{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Demonstration of PDBsum ligand interface data to dataframe script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This largely parallels the notebook [Working with PDBsum in Jupyter & Demonstration of PDBsum protein interface data to dataframe script](Working%20with%20PDBsum%20in%20Jupyter%20Basics.ipynb) except there, the data was protein-protein interaction list text. \n", "Here is is ligand and protein chain interaction list text that will be converted to a dataframe.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieving Ligand interface reports/ the list of interactions\n", "\n", "#### Getting list of involving a ligand under individual entries under PDBsum's 'Ligands' tab via command line.\n", "\n", "Say example from [here](http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=1wsv&template=ligands.html&l=2.1) links to the following as 'List of\n", "interactions' in the bottom right of the page:\n", "\n", "```text \n", "http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl?pdb=1wsv&ligtype=02&ligno=01\n", "```\n", " \n", "Then based on suggestion at top [here](https://stackoverflow.com/a/52363117/8508004) that would be used in a curl command where the items after the `?` in the original URL get placed into quotes and provided following the `--data` flag argument option in the call to `curl`, like so:\n", "```text\n", "curl -L -o data.txt --data \"pdb=1wsv&ligtype=02&ligno=01\" http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl\n", "```\n", "\n", "**Specifically**, the `--data \"pdb=1wsv&ligtype=02&ligno=01\"` is the part coming from the end of the original URL.\n", "\n", "\n", "Putting that into action in Jupyter to fetch for the example the interactions list in a text:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 7082 0 7054 100 28 13697 54 --:--:-- --:--:-- --:--:-- 13751\n" ] } ], "source": [ "!curl -L -o data.txt --data \"pdb=1wsv&ligtype=02&ligno=01\" http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To prove that the data file has been retieved, we'll show the first 16 lines of it by running the next cell:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\r\n",
      "List of protein-ligand interactions\r\n",
      "-----------------------------------\r\n",
      "

\r\n", " PDB code: 1wsv Ligand THH\r\n", " ---------------------------\r\n", "

\r\n", "\r\n", "Hydrogen bonds\r\n", "--------------\r\n", "\r\n", " <----- A T O M 1 -----> <----- A T O M 2 ----->\r\n", "\r\n", " Atom Atom Res Res Atom Atom Res Res\r\n", " no. name name no. Chain no. name name no. Chain Distance\r\n", " 1. 670 N LEU 88 A --> 5701 OE2 THH 3001 A 3.08\r\n" ] } ], "source": [ "!head -16 data.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Later in this series of notebooks, I'll demonstrate how to make this step even easier with just the PDB entry id and the chains you are interested in and the later how to loop on this process to get multiple data files for interactions from different structures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making a Pandas dataframe from the interactions file\n", "\n", "To convert the data to a dataframe, we'll use a script.\n", "\n", " If you haven't encountered Pandas dataframes before I suggest you see the first two notebooks that come up with you launch a session from my [blast-binder](https://github.com/fomightez/blast-binder) site. Those first two notebooks cover using the dataframe containing BLAST results some. \n", " \n", "To get that script, you can run the next cell. (It is not included in the repository where this launches from to insure you always get the most current version, which is assumed to be the best available at the time.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 24999 100 24999 0 0 173k 0 --:--:-- --:--:-- --:--:-- 173k\n" ] } ], "source": [ "!curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/pdbsum_ligand_interactions_list_to_df.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have the script now. And we already have a data file for it to process. To process the data file, run the next command where we use Python to run the script and direct it at the results file, `data.txt`, we made just a few cells ago." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Provided interactions data read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'ligand_int_pickled_df.pkl'" ] } ], "source": [ "%run pdbsum_ligand_interactions_list_to_df.py data.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As of writing this, the script we are using outputs a file that is a binary, compact form of the dataframe. (That means it is tiny and not human readable. It is called 'pickled'. Saving in that form may seem odd, but as illustrated [here](#Output-to-more-universal,-table-like-formats) below this is is a very malleable form. And even more pertinent for dealing with data in Jupyter notebooks, there is actually an easier way to interact with this script when in Jupyter notebook that skips saving this intermediate file. So hang on through the long, more trandtional way of doing this before the easier way is introduced. And I saved it in the compact form and not the mroe typical tab-delimited form because we mostly won't go this route and might as well make tiny files while working along to a better route. It is easy to convert back and forth using the pickled form assuming you can match the Pandas/Python versions.)\n", "\n", "We can take that file where the dataframe is pickled, and bring it into active memory in this notebook with another command form the Pandas library. First, we have to import the Pandas library.\n", "Run the next command to bring the dataframe into active memory. Note the name comes from the name noted when we ran the script in the cell above." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_pickle(\"ligand_int_pickled_df.pkl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When that last cell ran, you won't notice any output, but something happened. We can look at that dataframe by calling it in a cell." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
0670NLEU88A5701OE2THH3001A3.08Hydrogen bonds
1876OVAL115A5721N8THH3001A2.82Hydrogen bonds
21541OE1GLU204A5731NA2THH3001A2.45Hydrogen bonds
31744NH2ARG233A5728O4THH3001A3.02Hydrogen bonds
42801OHTYR371A5707O2THH3001A2.78Hydrogen bonds
.......................................
832800CZTYR371A5707O2THH3001A3.37Non-bonded contacts
842801OHTYR371A5704CATHH3001A3.87Non-bonded contacts
852801OHTYR371A5705CTTHH3001A3.60Non-bonded contacts
862801OHTYR371A5707O2THH3001A2.78Non-bonded contacts
872801OHTYR371A5710OTHH3001A3.62Non-bonded contacts
\n", "

88 rows × 12 columns

\n", "
" ], "text/plain": [ " Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. \\\n", "0 670 N LEU 88 A 5701 \n", "1 876 O VAL 115 A 5721 \n", "2 1541 OE1 GLU 204 A 5731 \n", "3 1744 NH2 ARG 233 A 5728 \n", "4 2801 OH TYR 371 A 5707 \n", ".. ... ... ... ... ... ... \n", "83 2800 CZ TYR 371 A 5707 \n", "84 2801 OH TYR 371 A 5704 \n", "85 2801 OH TYR 371 A 5705 \n", "86 2801 OH TYR 371 A 5707 \n", "87 2801 OH TYR 371 A 5710 \n", "\n", " Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "0 OE2 THH 3001 A 3.08 \n", "1 N8 THH 3001 A 2.82 \n", "2 NA2 THH 3001 A 2.45 \n", "3 O4 THH 3001 A 3.02 \n", "4 O2 THH 3001 A 2.78 \n", ".. ... ... ... ... ... \n", "83 O2 THH 3001 A 3.37 \n", "84 CA THH 3001 A 3.87 \n", "85 CT THH 3001 A 3.60 \n", "86 O2 THH 3001 A 2.78 \n", "87 O THH 3001 A 3.62 \n", "\n", " type \n", "0 Hydrogen bonds \n", "1 Hydrogen bonds \n", "2 Hydrogen bonds \n", "3 Hydrogen bonds \n", "4 Hydrogen bonds \n", ".. ... \n", "83 Non-bonded contacts \n", "84 Non-bonded contacts \n", "85 Non-bonded contacts \n", "86 Non-bonded contacts \n", "87 Non-bonded contacts \n", "\n", "[88 rows x 12 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that if the list of data is large, that the Jupyter environment represents just the head and tail to make it more reasonable. There are ways you can have Jupyter display it all which we won't go into here. \n", "\n", "Instead we'll start to show some methods of dataframes that make them convenient. For example, you can use the `head` method to see the start like we used on the command line above." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
0670NLEU88A5701OE2THH3001A3.08Hydrogen bonds
1876OVAL115A5721N8THH3001A2.82Hydrogen bonds
21541OE1GLU204A5731NA2THH3001A2.45Hydrogen bonds
31744NH2ARG233A5728O4THH3001A3.02Hydrogen bonds
42801OHTYR371A5707O2THH3001A2.78Hydrogen bonds
\n", "
" ], "text/plain": [ " Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. \\\n", "0 670 N LEU 88 A 5701 \n", "1 876 O VAL 115 A 5721 \n", "2 1541 OE1 GLU 204 A 5731 \n", "3 1744 NH2 ARG 233 A 5728 \n", "4 2801 OH TYR 371 A 5707 \n", "\n", " Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "0 OE2 THH 3001 A 3.08 \n", "1 N8 THH 3001 A 2.82 \n", "2 NA2 THH 3001 A 2.45 \n", "3 O4 THH 3001 A 3.02 \n", "4 O2 THH 3001 A 2.78 \n", "\n", " type \n", "0 Hydrogen bonds \n", "1 Hydrogen bonds \n", "2 Hydrogen bonds \n", "3 Hydrogen bonds \n", "4 Hydrogen bonds " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now what types of interactions are observed for this ligand?\n", "\n", "To help answer that, we can group the results by the type column." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hydrogen bonds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
0670NLEU88A5701OE2THH3001A3.08Hydrogen bonds
1876OVAL115A5721N8THH3001A2.82Hydrogen bonds
21541OE1GLU204A5731NA2THH3001A2.45Hydrogen bonds
31744NH2ARG233A5728O4THH3001A3.02Hydrogen bonds
42801OHTYR371A5707O2THH3001A2.78Hydrogen bonds
\n", "
" ], "text/plain": [ " Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. \\\n", "0 670 N LEU 88 A 5701 \n", "1 876 O VAL 115 A 5721 \n", "2 1541 OE1 GLU 204 A 5731 \n", "3 1744 NH2 ARG 233 A 5728 \n", "4 2801 OH TYR 371 A 5707 \n", "\n", " Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "0 OE2 THH 3001 A 3.08 \n", "1 N8 THH 3001 A 2.82 \n", "2 NA2 THH 3001 A 2.45 \n", "3 O4 THH 3001 A 3.02 \n", "4 O2 THH 3001 A 2.78 \n", "\n", " type \n", "0 Hydrogen bonds \n", "1 Hydrogen bonds \n", "2 Hydrogen bonds \n", "3 Hydrogen bonds \n", "4 Hydrogen bonds " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Non-bonded contacts\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
5433SDMET56A5724C4ATHH3001A3.83Non-bonded contacts
6434CEMET56A5727C4THH3001A3.64Non-bonded contacts
7434CEMET56A5729N3THH3001A3.44Non-bonded contacts
8434CEMET56A5730C2THH3001A3.64Non-bonded contacts
9664CATHR87A5701OE2THH3001A3.79Non-bonded contacts
.......................................
832800CZTYR371A5707O2THH3001A3.37Non-bonded contacts
842801OHTYR371A5704CATHH3001A3.87Non-bonded contacts
852801OHTYR371A5705CTTHH3001A3.60Non-bonded contacts
862801OHTYR371A5707O2THH3001A2.78Non-bonded contacts
872801OHTYR371A5710OTHH3001A3.62Non-bonded contacts
\n", "

83 rows × 12 columns

\n", "
" ], "text/plain": [ " Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. \\\n", "5 433 SD MET 56 A 5724 \n", "6 434 CE MET 56 A 5727 \n", "7 434 CE MET 56 A 5729 \n", "8 434 CE MET 56 A 5730 \n", "9 664 CA THR 87 A 5701 \n", ".. ... ... ... ... ... ... \n", "83 2800 CZ TYR 371 A 5707 \n", "84 2801 OH TYR 371 A 5704 \n", "85 2801 OH TYR 371 A 5705 \n", "86 2801 OH TYR 371 A 5707 \n", "87 2801 OH TYR 371 A 5710 \n", "\n", " Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "5 C4A THH 3001 A 3.83 \n", "6 C4 THH 3001 A 3.64 \n", "7 N3 THH 3001 A 3.44 \n", "8 C2 THH 3001 A 3.64 \n", "9 OE2 THH 3001 A 3.79 \n", ".. ... ... ... ... ... \n", "83 O2 THH 3001 A 3.37 \n", "84 CA THH 3001 A 3.87 \n", "85 CT THH 3001 A 3.60 \n", "86 O2 THH 3001 A 2.78 \n", "87 O THH 3001 A 3.62 \n", "\n", " type \n", "5 Non-bonded contacts \n", "6 Non-bonded contacts \n", "7 Non-bonded contacts \n", "8 Non-bonded contacts \n", "9 Non-bonded contacts \n", ".. ... \n", "83 Non-bonded contacts \n", "84 Non-bonded contacts \n", "85 Non-bonded contacts \n", "86 Non-bonded contacts \n", "87 Non-bonded contacts \n", "\n", "[83 rows x 12 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "grouped = df.groupby('type')\n", "for type, grouped_df in grouped:\n", " print(type)\n", " display(grouped_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same data as earlier but we can cleary see we have Hydrogen bonds and Non-bonded contacts, and we immediately get a sense of what types of interactions are more abundant.\n", "\n", "What if we wanted to know the residues in chain A that interact with the ligand?\n", "\n", "We can easily make a list of the residues in the 'Atom1 Res no.' column. Lots will be repeated because that list is coming from all the atoms from each residue. To limit it to just showing a residue number once, no matter if it as a single or dozens of interactions wiht the ligand, we can use Python's set conversion of a list to limit it to the unique residues. The code in the next cell does that: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{56,\n", " 87,\n", " 88,\n", " 101,\n", " 102,\n", " 103,\n", " 115,\n", " 117,\n", " 176,\n", " 177,\n", " 196,\n", " 197,\n", " 204,\n", " 233,\n", " 242,\n", " 262,\n", " 371}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "the_list = df[\"Atom1 Res no.\"].tolist()\n", "residues_interacting_with_ligand = set(the_list)\n", "residues_interacting_with_ligand" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From that list, we can see what residues of chain A interact with the ligand.\n", "\n", "What if we wanted to know what residues are involved in hydrogen bonds? Then we can subset on the rows where `type` matches 'Hydrogen bonds' and look at those residues." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{88, 115, 204, 233, 371}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "the_hbond_list = df[df[\"type\"]==\"Hydrogen bonds\"][\"Atom1 Res no.\"].tolist()\n", "residues_interacting_with_ligand_via_hbond = set(the_hbond_list)\n", "residues_interacting_with_ligand_via_hbond" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may want to get a sense of what else you can do by examining the first two notebooks that come up with you launch a session from my [blast-binder](https://github.com/fomightez/blast-binder) site. Those first two notebooks cover using the dataframe containing BLAST results some.\n", "\n", "Shortly, we'll cover how to bring the dataframe we just made into the notebook without dealing with a file intermediate; however, next I'll demonstrate how to save it as text for use elsewhere, such as in Excel." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output to more universal, table-like formats\n", "\n", "I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.\n", "\n", "**First, generating a text-based table.**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file\n", "df.to_csv('pdbsum_data.tsv', sep='\\t',index = False) #add `,header=False` to leave off header, too" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because `df.to_csv()` defaults to dealing with csv, you can simply use `df.to_csv('example.csv',index = False)` for comma-delimited (comma-separated) files.\n", "\n", "You can see that worked by looking at the first few lines with the next command. (Feel free to make the number higher or delete the number all together. I restricted it just to first line to make output smaller.)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Atom1 no.\tAtom1 name\tAtom1 Res name\tAtom1 Res no.\tAtom1 Chain\tAtom2 no.\tAtom2 name\tAtom2 Res name\tAtom2 Res no.\tAtom2 Chain\tDistance\ttype\r\n", "670\tN\tLEU\t88\tA\t5701\tOE2\tTHH\t3001\tA\t3.08\tHydrogen bonds\r\n", "876\tO\tVAL\t115\tA\t5721\tN8\tTHH\t3001\tA\t2.82\tHydrogen bonds\r\n", "1541\tOE1\tGLU\t204\tA\t5731\tNA2\tTHH\t3001\tA\t2.45\tHydrogen bonds\r\n", "1744\tNH2\tARG\t233\tA\t5728\tO4\tTHH\t3001\tA\t3.02\tHydrogen bonds\r\n" ] } ], "source": [ "!head -5 pdbsum_data.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "reverted_df = pd.read_csv('pdbsum_data.tsv', sep='\\t')\n", "reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a comma-delimited (CSV) file you'd use `df = pd.read_csv('example.csv')` because `pd.read_csv()` method defaults to comma as the separator (`sep` parameter).\n", "\n", "You can verify that read from the text-based table by viewing it with the next line." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
0670NLEU88A5701OE2THH3001A3.08Hydrogen bonds
1876OVAL115A5721N8THH3001A2.82Hydrogen bonds
21541OE1GLU204A5731NA2THH3001A2.45Hydrogen bonds
31744NH2ARG233A5728O4THH3001A3.02Hydrogen bonds
42801OHTYR371A5707O2THH3001A2.78Hydrogen bonds
\n", "
" ], "text/plain": [ " Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. \\\n", "0 670 N LEU 88 A 5701 \n", "1 876 O VAL 115 A 5721 \n", "2 1541 OE1 GLU 204 A 5731 \n", "3 1744 NH2 ARG 233 A 5728 \n", "4 2801 OH TYR 371 A 5707 \n", "\n", " Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "0 OE2 THH 3001 A 3.08 \n", "1 N8 THH 3001 A 2.82 \n", "2 NA2 THH 3001 A 2.45 \n", "3 O4 THH 3001 A 3.02 \n", "4 O2 THH 3001 A 2.78 \n", "\n", " type \n", "0 Hydrogen bonds \n", "1 Hydrogen bonds \n", "2 Hydrogen bonds \n", "3 Hydrogen bonds \n", "4 Hydrogen bonds " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reverted_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Generating an Excel spreadsheet from a dataframe.**\n", "\n", "Because this is a specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9)\n", "Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install openpyxl\n", "# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)\n", "df.to_excel('pdbsum_data.xlsx') # after openpyxl installed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.\n", "\n", "Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well.\n", "\n", "Excel files can be read in to Pandas dataframes directly without needing to go to a text based intermediate first." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# read Excel\n", "df_from_excel = pd.read_excel('pdbsum_data.xlsx',engine='openpyxl') # see https://stackoverflow.com/a/65266270/8508004 where notes xlrd no longer supports xlsx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That can be viewed to convince yourself it worked by running the next command." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
00670NLEU88A5701OE2THH3001A3.08Hydrogen bonds
11876OVAL115A5721N8THH3001A2.82Hydrogen bonds
221541OE1GLU204A5731NA2THH3001A2.45Hydrogen bonds
331744NH2ARG233A5728O4THH3001A3.02Hydrogen bonds
442801OHTYR371A5707O2THH3001A2.78Hydrogen bonds
\n", "
" ], "text/plain": [ " Unnamed: 0 Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain \\\n", "0 0 670 N LEU 88 A \n", "1 1 876 O VAL 115 A \n", "2 2 1541 OE1 GLU 204 A \n", "3 3 1744 NH2 ARG 233 A \n", "4 4 2801 OH TYR 371 A \n", "\n", " Atom2 no. Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "0 5701 OE2 THH 3001 A 3.08 \n", "1 5721 N8 THH 3001 A 2.82 \n", "2 5731 NA2 THH 3001 A 2.45 \n", "3 5728 O4 THH 3001 A 3.02 \n", "4 5707 O2 THH 3001 A 2.78 \n", "\n", " type \n", "0 Hydrogen bonds \n", "1 Hydrogen bonds \n", "2 Hydrogen bonds \n", "3 Hydrogen bonds \n", "4 Hydrogen bonds " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_from_excel.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll cover how to bring the dataframe we just made into the notebook without dealing with a file intermediate.\n", "\n", "----\n", "\n", "### Making a Pandas dataframe from the ligand interactions data file directly in Jupyter\n", "\n", "First we'll check for the script we'll use and get it if we don't already have it. \n", "\n", "(The thinking is once you know what you are doing you may have skipped all the steps above and not have the script you'll need yet. It cannot hurt to check and if it isn't present, bring it here.)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Get a file if not yet retrieved / check if file exists\n", "import os\n", "file_needed = \"pdbsum_ligand_interactions_list_to_df.py\"\n", "if not os.path.isfile(file_needed):\n", " !curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/{file_needed}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is going to rely on approaches very similar to those illustrated [here](https://github.com/fomightez/patmatch-binder/blob/6f7630b2ee061079a72cd117127328fd1abfa6c7/notebooks/PatMatch%20with%20more%20Python.ipynb#Passing-results-data-into-active-memory-without-a-file-intermediate) and [here](https://github.com/fomightez/patmatch-binder/blob/6f7630b2ee061079a72cd117127328fd1abfa6c7/notebooks/Sending%20PatMatch%20output%20directly%20to%20Python.ipynb##Running-Patmatch-and-passing-the-results-to-Python-without-creating-an-output-file-intermediate).\n", "\n", "We obtained the `pdbsum_ligand_interactions_list_to_df.py` script in the preparation steps above. However, instead of using it as an external script as we did earlier in this notebook, we want to use the core function of that script within this notebook for the options that involve no pickled-object file intermediate. Similar to the way we imported a lot of other useful modules in the first notebook and a cell above, you can run the next cell to bring in to memory of this notebook's computational environment, the main function associated with the `pdbsum_prot_interactions_list_to_df.py` script, aptly named `pdbsum_ligand_interactions_list_to_df`. (As written below the command to do that looks a bit redundant;however, the first from part of the command below actually is referencing the `pdbsum_ligand_interactions_list_to_df.py` script, but it doesn't need the `.py` extension because the import only deals with such files.)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from pdbsum_ligand_interactions_list_to_df import pdbsum_ligand_interactions_list_to_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can demonstrate that worked by calling the function." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "pdbsum_ligand_interactions_list_to_df() missing 1 required positional argument: 'data_file'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_87/1310711332.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpdbsum_ligand_interactions_list_to_df\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: pdbsum_ligand_interactions_list_to_df() missing 1 required positional argument: 'data_file'" ] } ], "source": [ "pdbsum_ligand_interactions_list_to_df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the module was not imported, you'd see `ModuleNotFoundError: No module named 'pdbsum_ligand_interactions_list_to_df'`, but instead you should see it saying it is missing `data_file` to act on because you passed it nothing.\n", "\n", "After importing the main function of that script into this running notebook, you are ready to demonstrate the approach that doesn't require a file intermediates. The imported `pdbsum_ligand_interactions_list_to_df` function is used within the computational environment of the notebook and the dataframe produced assigned to a variable in the running the notebook. In the end, the results are in an active dataframe in the notebook without needing to read the pickled dataframe. **Although bear in mind the pickled dataframe still gets made, and it is good to download and keep that pickled dataframe since you'll find it convenient for reading and getting back into an analysis without need for rerunning earlier steps again.**" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Provided interactions data read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'ligand_int_pickled_df.pkl'\n", "\n", "Returning a dataframe with the information as well." ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Atom1 no.Atom1 nameAtom1 Res nameAtom1 Res no.Atom1 ChainAtom2 no.Atom2 nameAtom2 Res nameAtom2 Res no.Atom2 ChainDistancetype
0670NLEU88A5701OE2THH3001A3.08Hydrogen bonds
1876OVAL115A5721N8THH3001A2.82Hydrogen bonds
21541OE1GLU204A5731NA2THH3001A2.45Hydrogen bonds
31744NH2ARG233A5728O4THH3001A3.02Hydrogen bonds
42801OHTYR371A5707O2THH3001A2.78Hydrogen bonds
\n", "
" ], "text/plain": [ " Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. \\\n", "0 670 N LEU 88 A 5701 \n", "1 876 O VAL 115 A 5721 \n", "2 1541 OE1 GLU 204 A 5731 \n", "3 1744 NH2 ARG 233 A 5728 \n", "4 2801 OH TYR 371 A 5707 \n", "\n", " Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance \\\n", "0 OE2 THH 3001 A 3.08 \n", "1 N8 THH 3001 A 2.82 \n", "2 NA2 THH 3001 A 2.45 \n", "3 O4 THH 3001 A 3.02 \n", "4 O2 THH 3001 A 2.78 \n", "\n", " type \n", "0 Hydrogen bonds \n", "1 Hydrogen bonds \n", "2 Hydrogen bonds \n", "3 Hydrogen bonds \n", "4 Hydrogen bonds " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "direct_df = pdbsum_ligand_interactions_list_to_df(\"data.txt\")\n", "direct_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This may be how you prefer to use the script. Either option exists.\n", "\n", "----\n", "\n", "Continue on with other notebooks in the series if you wish.\n", "\n", "----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }