{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Demonstration of PDBsum info grab script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This largely builds on the notebook [Demonstration of PDBsum ligand interface data to dataframe script](Demonstration%20of%20PDBsum%20ligand%20interface%20data%20to%20dataframe%20script.ipynb) and extends it to get some more general information about the experimentally-determined structure besides the ligand to help guide an answer to Biostars post [How to get specific information from a list of PDB ID and save all in a spreadsheet?](https://www.biostars.org/p/9515231/#9515231). As dicussed in there, because this isn't derived data, there's no reason in particular to use PDBsum for extracting this information. PDBfinder database may be better; however, for getting derived data, I already had the pipeline working with PDBsum and it was just a matter of adapting a couple scripts to get some details associated with the structures that are also listed at other places such as the Protein Data Bank, [the OCA Database and Browse](http://oca.weizmann.ac.il/oca-bin/ocamain), or Proteopedia.\n", "\n", "----\n", "\n", "## preparation\n", "\n", "Let's make a file listing PDB accesion codes each on a separate line like we might get sent by someone looking for information on many PDB entries. (This is mostly so this notebook is self-contained.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing 'pdb_ids_each_online' (str) to file 'pdb_ids.txt'.\n" ] } ], "source": [ "pdb_ids_each_online='''1wsv\n", "1eve\n", "1btn\n", "1trn'''\n", "%store pdb_ids_each_online >pdb_ids.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the latest versions of the related scripts." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 14198 100 14198 0 0 70287 0 --:--:-- --:--:-- --:--:-- 70287\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 13917 100 13917 0 0 69934 0 --:--:-- --:--:-- --:--:-- 69934\n" ] } ], "source": [ "# Get a file if not yet retrieved / check if file exists\n", "import os\n", "file_needed = \"pdbsum_stats_and_info_adpated_example.py\"\n", "if not os.path.isfile(file_needed):\n", " !curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/{file_needed}\n", "file_needed = \"pdb_ids_to_stats_and_info_df.py\"\n", "if not os.path.isfile(file_needed):\n", " !curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/{file_needed}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demonstration of getting information on a single PDB entry\n", "\n", "Say interested in example PDBsum page from [here](http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=1eve&template=main.html) on the 'Top page' tab and we'd like to mine out some details. \n", "\n", "Quick example of use of `pdbsum_stats_and_info_adpated_example.py` to get information on a single PDB entry.\n", "\n", "First, along the lines of how you'd use it on the command line. (%run is special to notebooks; you'd replace `%run` in the command above with `python` or appropriate call for your python on your own command line.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Details for specified structure read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'statsninfo_pickled_df.pkl'" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PDB idResolutionR valueLigands
01eve2.50 Å0.188NAG-NAG, NAG, E20
\n", "
" ], "text/plain": [ " PDB id Resolution R value Ligands\n", "0 1eve 2.50 Å 0.188 NAG-NAG, NAG, E20" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%run pdbsum_stats_and_info_adpated_example.py 1eve\n", "# read in the dataframe produced\n", "import pandas as pd\n", "df_made = pd.read_pickle(\"statsninfo_pickled_df.pkl\")\n", "df_made" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(%run is special to notebooks; you'd replace `%run` in the command above with `python` or appropriate call for your python on your own command line.)\n", " \n", "You could also use it inside a Jupyter notebook by importin the main function and then using a function call supplying the PDB code identifier of interest as an argument." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Details for specified structure read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'statsninfo_pickled_df.pkl'\n", "\n", "Returning a dataframe with the information as well." ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PDB idResolutionR valueLigands
01trn2.20 Å0.177ISP
\n", "
" ], "text/plain": [ " PDB id Resolution R value Ligands\n", "0 1trn 2.20 Å 0.177 ISP" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pdbsum_stats_and_info_adpated_example import pdbsum_stats_and_info_adpated_example\n", "df_func = pdbsum_stats_and_info_adpated_example(\"1trn\")\n", "df_func" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that retrurned a dataframe directly so we didn't necessarily have to read it back in to memory like for the command line version. Also note that it did indeed save a file so that we could take that and later use it elsewhere and bring the dataframe back into memory without generating it first." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demonstration of getting information on several PDB entries provided as a list\n", "\n", "This section will demonstrate getting information on several PDB entries provided as a file with each identifier code listed on a separate line.\n", "\n", "A file listing the PDB id codes was already saved during the preparation. We can show it here." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1wsv\r\n", "1eve\r\n", "1btn\r\n", "1trn\r\n" ] } ], "source": [ "!cat pdb_ids.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can point the script that will take that list and run it as many times as necessary to collect the information for each PBB code listed.\n", "\n", "As above, first we demonstrate this similar to how it would be used on the command line. (%run is special to notebooks; you'd replace `%run` in the command above with `python` or appropriate call for your python on your own command line.)\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Parsing details from PDBsum ...\n", "Information for specified structures read and converted to a single dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'stats_and_info_for_PDBids_pickled_df.pkl'" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PDB idResolutionR valueLigands
01wsv2.60 Å0.171SO4, THH
11eve2.50 Å0.188NAG-NAG, NAG, E20
21btn2.00 Å0.205I3P
31trn2.20 Å0.177ISP
\n", "
" ], "text/plain": [ " PDB id Resolution R value Ligands\n", "0 1wsv 2.60 Å 0.171 SO4, THH\n", "1 1eve 2.50 Å 0.188 NAG-NAG, NAG, E20\n", "2 1btn 2.00 Å 0.205 I3P\n", "3 1trn 2.20 Å 0.177 ISP" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%run pdb_ids_to_stats_and_info_df.py pdb_ids.txt\n", "# read in the dataframe produced\n", "import pandas as pd\n", "df_ids = pd.read_pickle(\"stats_and_info_for_PDBids_pickled_df.pkl\")\n", "df_ids" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(%run is special to notebooks; you'd replace `%run` in the command above with `python` or appropriate call for your python on your own command line.)\n", "\n", "It is probably wise to make sure it works with the demonstration data first and then replace the PDB codes in `pdb_ids.txt` or upload your own file listing PDB code ids and point the script at it with one of the options demonstrated here.\n", "\n", "You could do that in a notebook by importing the main function of the script and pointing it at the file listing the PDB codes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Parsing details from PDBsum ...\n", "Information for specified structures read and converted to a single dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'stats_and_info_for_PDBids_pickled_df.pkl'\n", "Returning a dataframe with the information as well." ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PDB idResolutionR valueLigands
01wsv2.60 Å0.171SO4, THH
11eve2.50 Å0.188NAG-NAG, NAG, E20
21btn2.00 Å0.205I3P
31trn2.20 Å0.177ISP
\n", "
" ], "text/plain": [ " PDB id Resolution R value Ligands\n", "0 1wsv 2.60 Å 0.171 SO4, THH\n", "1 1eve 2.50 Å 0.188 NAG-NAG, NAG, E20\n", "2 1btn 2.00 Å 0.205 I3P\n", "3 1trn 2.20 Å 0.177 ISP" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pdb_ids_to_stats_and_info_df import pdb_ids_to_stats_and_info_df\n", "df = pdb_ids_to_stats_and_info_df(\"pdb_ids.txt\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is probably wise to make sure it works with the demonstration data first and then replace the PDB codes in `pdb_ids.txt` or upload your own file listing PDB code ids and point the script at it with one of the options demonstrated here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may want to get a sense of what else you can do with that dataframe by examining the previous notebook in this series, [Demonstration of PDBsum ligand interface data to dataframe script](Demonstration%20of%20PDBsum%20ligand%20interface%20data%20to%20dataframe%20script.ipynb).\n", "\n", "Next I'll demonstrate how to save it as text for use elsewhere, such as in Excel. Or even make a spreadsheet file in the Excel format directly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output to more universal, table-like formats\n", "\n", "If you do look at the previous notebooks in this series, you'll see that I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.\n", "\n", "**First, generating a text-based table.**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file\n", "df.to_csv('pdbsum_info.tsv', sep='\\t',index = False) #add `,header=False` to leave off header, too" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because `df.to_csv()` defaults to dealing with csv, you can simply use `df.to_csv('example.csv',index = False)` for comma-delimited (comma-separated) files.\n", "\n", "You can see that worked by looking at the text file made with the next command. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PDB id\tResolution\tR value\tLigands\r\n", "1wsv\t2.60 Å\t0.171\tSO4, THH\r\n", "1eve\t2.50 Å\t0.188\tNAG-NAG, NAG, E20\r\n", "1btn\t2.00 Å\t0.205\tI3P\r\n", "1trn\t2.20 Å\t0.177\tISP\r\n" ] } ], "source": [ "!cat pdbsum_info.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "reverted_df = pd.read_csv('pdbsum_info.tsv', sep='\\t')\n", "reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a comma-delimited (CSV) file you'd use `df = pd.read_csv('example.csv')` because `pd.read_csv()` method defaults to comma as the separator (`sep` parameter).\n", "\n", "You can verify that read from the text-based table by viewing it with the next line." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PDB idResolutionR valueLigands
01wsv2.60 Å0.171SO4, THH
11eve2.50 Å0.188NAG-NAG, NAG, E20
21btn2.00 Å0.205I3P
31trn2.20 Å0.177ISP
\n", "
" ], "text/plain": [ " PDB id Resolution R value Ligands\n", "0 1wsv 2.60 Å 0.171 SO4, THH\n", "1 1eve 2.50 Å 0.188 NAG-NAG, NAG, E20\n", "2 1btn 2.00 Å 0.205 I3P\n", "3 1trn 2.20 Å 0.177 ISP" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reverted_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Generating an Excel spreadsheet from a dataframe.**\n", "\n", "Because this is a specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9)\n", "Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install openpyxl\n", "# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)\n", "df.to_excel('pdbsum_data.xlsx',index = False,) # after openpyxl installed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.\n", "\n", "Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well. That is covered elsewhere in resources referenced by other notebooks in this series.\n", "\n", "Excel files can be read in to Pandas dataframes directly without needing to go to a text based intermediate first." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# read Excel\n", "df_from_excel = pd.read_excel('pdbsum_data.xlsx',engine='openpyxl') # see https://stackoverflow.com/a/65266270/8508004 where notes xlrd no longer supports xlsx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That can be viewed to convince yourself it worked by running the next command." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PDB idResolutionR valueLigands
01wsv2.60 Å0.171SO4, THH
11eve2.50 Å0.188NAG-NAG, NAG, E20
21btn2.00 Å0.205I3P
31trn2.20 Å0.177ISP
\n", "
" ], "text/plain": [ " PDB id Resolution R value Ligands\n", "0 1wsv 2.60 Å 0.171 SO4, THH\n", "1 1eve 2.50 Å 0.188 NAG-NAG, NAG, E20\n", "2 1btn 2.00 Å 0.205 I3P\n", "3 1trn 2.20 Å 0.177 ISP" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_from_excel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "----\n", "\n", "Continue on with other notebooks in the series, listed [here](../index.ipynb) if you wish.\n", "\n", "----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 4 }