{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2-Filtering\n",
    "This tutorial demonstrates how to filter PDB to create subsets of structures. For details see [filters](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/filters) and [demos](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/demos/filters)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import pyspark and mmtfPyspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "from mmtfPyspark.io import mmtfReader\n",
    "from mmtfPyspark.filters import ContainsGroup, ContainsLProteinChain, PolymerComposition, Resolution \n",
    "from mmtfPyspark.structureViewer import view_group_interaction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = SparkSession.builder.appName(\"2-Filtering\").getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read PDB structures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9756"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = \"../resources/mmtf_reduced_sample\"\n",
    "pdb = mmtfReader.read_sequence_file(path).cache()\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter by Quality Metrics\n",
    "Structures can be filtered by [Resolution](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/resolution) and [R-free](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/r-value-and-r-free). Each filter takes a minimum and maximum values. The example below returns structures with a resolution in the inclusive range [0.0, 1.5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2941"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb = pdb.filter(Resolution(0.0, 1.5))\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter by Polymer Chain Types\n",
    "A number of filters are available to filter by the type of the polymer chain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a subset of structures that contain at least one L-protein chain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2912"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb = pdb.filter(ContainsLProteinChain())\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a subset of structure that exclusively contain L-protein chains (e.g., exclude protein-nucleic acid complexes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2788"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb = pdb.filter(ContainsLProteinChain(exclusive=True))\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Keep protein structures that exclusively contain chains made out of the 20 standard amino acids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2171"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb = pdb.filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20, exclusive=True))\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find the subset of structures that contains ATP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdb = pdb.filter(ContainsGroup(\"ATP\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize the hits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "70396abec8444ba6bf758eed78dcc8a3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=11), Output()),…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "view_group_interaction(pdb.keys().collect(),\"ATP\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter with a lambda expression\n",
    "Rather than using a pre-made filter, we can create simple filters using lambda expressions. The expression needs to evaluate to a boolean type.\n",
    "\n",
    "The variable t in the lambda expression below represents a tuple and t[1] is the second element in the tuple representing the mmtfStructure. \n",
    "\n",
    "Here, we filter by the number of atoms in an entry. You will learn more about extracting structural information from an mmtfStructure in future tutorials."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb = pdb.filter(lambda t: t[1].num_atoms < 500)\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, we can filter by the key, represented by the first element in a tuple: t[0].\n",
    "\n",
    "**Keys are case sensitive. Always use upper case PDB IDs in mmtf-pyspark!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb = pdb.filter(lambda t: t[0] in [\"4AFF\", \"4CBU\"])\n",
    "pdb.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}