{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Sets Tutorial\n",
    "This tutorial demonstrates how to create and use `DataSet` objects.  PyGSTi uses `DataSet` objects to hold experimental or simulated data in the form of outcome counts.  When a `DataSet` is used to hold time-independent data (the typical case, and all we'll look at in this tutorial), it essentially looks like a nested dictionary which associates operation sequences with dictionaries of (outcome-label,count) pairs so that `dataset[circuit][outcomeLabel]` can be used to read & write the number of `outcomeLabel` outcomes of the experiment given by the circuit `circuit`.\n",
    "\n",
    "There are a few important differences between a `DataSet` and a dictionary-of-dictionaries:\n",
    "- `DataSet` objects can be in one of two modes: *static* or *non-static*.  When in *non-static* mode, data can be freely modified within the set, making this mode to use during the data-entry.  In the *static* mode, data cannot be modified and the `DataSet` is essentially read-only.  The `done_adding_data` method of a `DataSet` switches from non-static to static mode, and should be called, as the name implies, once all desired data has been added (or modified).  Once a `DataSet` is static, it is read-only for the rest of its life; to modify its data the best one can do is make a non-static *copy* via the `copy_nonstatic` member and modify the copy.\n",
    "\n",
    "- Because `DataSet`s may contain time-dependent data, the dictionary-access syntax for a single outcome label (i.e. `dataset[circuit][outcomeLabel]`) *cannot* be used to write counts for new `circuit` keys; One should instead  use the `add_`*xxx* methods of the `DataSet` object.\n",
    "\n",
    "Once a `DataSet` is constructed, filled with data, and made *static*, it is typically passed as a parameter to one of pyGSTi's algorithm or driver routines to find a `Model` estimate based on the data.  This tutorial focuses on how to construct a `DataSet` and modify its data.  Later tutorials will demonstrate the different GST algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pygsti"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a `DataSet`\n",
    "There three basic ways to create `DataSet` objects in `pygsti`:\n",
    "* By creating an empty `DataSet` object and manually adding counts corresponding to operation sequences.  Remember that the `add_`*xxx* methods must be used to add data for operation sequences not yet in the `DataSet`.  Once the data is added, be sure to call `done_adding_data`, as this restructures the internal storage of the `DataSet` to optimize the access operations used by algorithms.\n",
    "* By loading from a text-format dataset file via `pygsti.io.load_dataset`.  The result is a ready-to-use-in-algorithms *static* `DataSet`, so there's no need to call `done_adding_data` this time.\n",
    "* By using a `Model` to generate \"fake\" data via `generate_fake_data`. This can be useful for doing simulations of GST, and comparing to your experimental results.\n",
    "\n",
    "We do each of these in turn in the cells below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#1) Creating a data set from scratch\n",
    "#    Note that tuples may be used in lieu of OpString objects\n",
    "ds1 = pygsti.objects.DataSet(outcome_labels=['0','1'])\n",
    "ds1.add_count_dict( ('Gx',), {'0': 10, '1': 90} )\n",
    "ds1.add_count_dict( ('Gx','Gy'), {'0': 40, '1': 60} )\n",
    "ds1[('Gy',)] = {'0': 10, '1': 90} # dictionary assignment\n",
    "\n",
    "#Modify existing data using dictionary-like access\n",
    "ds1[('Gx',)]['0'] = 15\n",
    "ds1[('Gx',)]['1'] = 85\n",
    "\n",
    "#OpString objects can be used.\n",
    "mdl = pygsti.objects.Circuit( ('Gx','Gy'))\n",
    "ds1[mdl]['0'] = 45\n",
    "ds1[mdl]['1'] = 55\n",
    "\n",
    "ds1.done_adding_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#2) By creating and loading a text-format dataset file.  The first\n",
    "#    row is a directive which specifies what the columns (after the\n",
    "#    first one) holds.  Note that \"0\" and \"1\" in are the \n",
    "#    SPAM labels and must match those of any Model used in \n",
    "#    conjuction with this DataSet.\n",
    "dataset_txt = \\\n",
    "\"\"\"## Columns = 0 count, 1 count\n",
    "{}@(0)             0 100\n",
    "Gxpi2:0@(0)        10 90\n",
    "Gxpi2:0Gypi2:0@(0) 40 60\n",
    "Gxpi2:0^4@(0)      20 90\n",
    "\"\"\"\n",
    "with open(\"../tutorial_files/Example_TinyDataset.txt\",\"w\") as tinydataset:\n",
    "    tinydataset.write(dataset_txt)\n",
    "ds2 = pygsti.io.load_dataset(\"../tutorial_files/Example_TinyDataset.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#3) By generating fake data (using the std1Q_XYI standard model module)\n",
    "from pygsti.modelpacks import smq1Q_XYI\n",
    "\n",
    "#Depolarize the perfect X,Y,I model\n",
    "depol_gateset = smq1Q_XYI.target_model().depolarize(op_noise=0.1)\n",
    "\n",
    "#Compute the sequences needed to perform Long Sequence GST on \n",
    "# this Model with sequences up to lenth 128\n",
    "exp_design = smq1Q_XYI.get_gst_experiment_design(max_max_length=128)\n",
    "circuit_list = exp_design.all_circuits_needing_data\n",
    "\n",
    "#Generate fake data (Tutorial 00)\n",
    "ds3 = pygsti.construction.simulate_data(depol_gateset, circuit_list, num_samples=1000,\n",
    "                                             sample_error='binomial', seed=100)\n",
    "ds3b = pygsti.construction.simulate_data(depol_gateset, circuit_list, num_samples=50,\n",
    "                                              sample_error='binomial', seed=100)\n",
    "\n",
    "#Package the ds3 and ds3b datasets together with their experiment design\n",
    "# and save to disk for later tutorials to use for protocols\n",
    "pygsti.protocols.ProtocolData(exp_design, ds3).write(\"../tutorial_files/Example_GST_Data\")\n",
    "pygsti.protocols.ProtocolData(exp_design, ds3b).write(\"../tutorial_files/Example_GST_Data_LowCnts\")\n",
    "\n",
    "#Also write the dataset files separately\n",
    "pygsti.io.write_dataset(\"../tutorial_files/Example_Dataset.txt\", ds3, outcome_label_order=['0','1']) \n",
    "pygsti.io.write_dataset(\"../tutorial_files/Example_Dataset_LowCnts.txt\", ds3b) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Viewing `DataSets`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#It's easy to just print them:\n",
    "print(\"Dataset1:\\n\",ds1)\n",
    "print(\"Dataset2:\\n\",ds2)\n",
    "print(\"Dataset3 is too big to print, so here it is truncated to Dataset2's strings\\n\", ds3.truncate(ds2.keys()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note that the outcome labels `'0'` and `'1'` appear as `('0',)` and `('1',)`**.  This is because outcome labels in pyGSTi are tuples of time-ordered instrument element (see the [intermediate measurements tutorial](advanced/Instruments.ipynb)) and POVM effect labels.  In the special but common case when there are no intermediate measurements, the outcome label is a 1-tuple of just the final POVM effect label.  In this case, one may use the effect label itself (e.g. `'0'` or `'1'`) in place of the 1-tuple in almost all contexts, as it is automatically converted to the 1-tuple (e.g. `('0',)` or `('1',)`) internally.  When printing, however, the 1-tuple is still displayed to remind the user of the more general structure contained in the `DataSet`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iteration over data sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A DataSet's keys() method returns a generator of OpString objects\n",
    "ds1.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# There are many ways to iterate over a DataSet.  Here's one:\n",
    "for circuit in ds1.keys():\n",
    "    dsRow = ds1[circuit]\n",
    "    for spamlabel in dsRow.counts.keys():\n",
    "        print(\"Circuit = %s, SPAM label = %s, count = %d\" % \\\n",
    "            (repr(circuit).ljust(13), str(spamlabel).ljust(7), dsRow[spamlabel]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Advanced features of data sets\n",
    "\n",
    "### `collision_action` argument\n",
    "When creating a `DataSet` one may specify the `collision_action` argument as either `\"aggregate\"` (the default) or `\"keepseparate\"`.  The former instructs the `DataSet` to simply add the counts of like outcomes when counts are added for an already existing gate sequence.  `\"keepseparate\"`, on the other hand, causes the `DataSet` to tag added count data by appending a fictitious `\"#<n>\"` operation label to a gate sequence that already exists, where `<n>` is an integer.  When retreiving the keys of a `keepseparate` data set, the `stripOccuranceTags` argument to `keys()` determines whether the `\"#<n>\"` labels are included in the output (if they're not - the default - duplicate keys may be returned).  Access to different occurances of the same data are provided via the `occurrance` argument of the `get_row` and `set_row` functions, which should be used instead of the usual bracket indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_agg = pygsti.objects.DataSet(outcome_labels=['0','1'], collision_action=\"aggregate\") #the default\n",
    "ds_agg.add_count_dict( ('Gx','Gy'), {'0': 10, '1': 90} )\n",
    "ds_agg.add_count_dict( ('Gx','Gy'), {'0': 40, '1': 60} )\n",
    "print(\"Aggregate-mode Dataset:\\n\",ds_agg)\n",
    "\n",
    "ds_sep = pygsti.objects.DataSet(outcome_labels=['0','1'], collision_action=\"keepseparate\")\n",
    "ds_sep.add_count_dict( ('Gx','Gy'), {'0': 10, '1': 90} )\n",
    "ds_sep.add_count_dict( ('Gx','Gy'), {'0': 40, '1': 60} )\n",
    "print(\"Keepseparate-mode Dataset:\\n\",ds_sep)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related topics\n",
    "This concludes our overview of `DataSet` objects.  Here are a couple of related topics we didn't touch on that might be of interest:\n",
    "- the ability of a `DataSet` to store **time-dependent data** (in addition to the count data described above) is covered in the [timestamped data tutorial](advanced/TimestampedDataSets.ipynb).\n",
    "- the `MultiDataSet` object, which is similar to a dictionary of `DataSets` is described in the [MultiDataSet tutorial](advanced/MultiDataSet.ipynb).  `MultiDataSet` objects are useful when you have several data sets containing outcomes for the *same* set of circuits, like when you do multiple passes of the same experimental procedure."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}