{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "BioPandas\n", "\n", "Author: Sebastian Raschka \n", "License: BSD 3 clause \n", "Project Website: http://rasbt.github.io/biopandas/ \n", "Code Repository: https://github.com/rasbt/biopandas " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last updated: 2022-05-12\n", "\n", "pandas : 1.4.0\n", "biopandas: 0.4.0\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -d -u -p pandas,biopandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from biopandas.pdb import PandasPdb\n", "import pandas as pd\n", "pd.set_option('display.width', 600)\n", "pd.set_option('display.max_columns', 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with PDB Structures in DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading PDB Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several ways to load a PDB structure into a `PandasPdb` object.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1 -- Loading a PDB file from the Protein Data Bank\n", "\n", "PDB files can be directly fetched from The Protein Data Bank at [http://www.rcsb.org](http://www.rcsb.org) via its unique 4-letter after initializing a new [`PandasPdb`](../api/biopandas.pdb#pandaspdb) object and calling the [`fetch_pdb`](../api/biopandas.pdb#pandaspdbfetch_pdb) method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from biopandas.pdb import PandasPdb\n", "\n", "# Initialize a new PandasPdb object\n", "# and fetch the PDB file from rcsb.org\n", "ppdb = PandasPdb().fetch_pdb('3eiy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2 -- Loading a PDB file from the AlphaFold Structure Database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "(*New in version 0.4.0*)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PDB files can be directly fetched from The AlphaFold Structure Database at [https://alphafold.ebi.ac.uk/](https://alphafold.ebi.ac.uk/) via its unique [UniProt](https://www.uniprot.org/) Identifier after initializing a new [`PandasPdb`](../api/biopandas.pdb#pandaspdb) object and calling the [`fetch_af2`](../api/biopandas.pdb#pandaspdbfetch_pdb) method:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from biopandas.pdb import PandasPdb\n", "\n", "# Initialize a new PandasPdb object\n", "# and fetch the PDB file from alphafold.ebi.ac.uk\n", "ppdb = PandasPdb().fetch_pdb(uniprot_id='Q5VSL9', source=\"alphafold2-v2\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0ATOM1N...NNaN111
1ATOM2CA...CNaN112
2ATOM3C...CNaN113
3ATOM4CB...CNaN114
4ATOM5O...ONaN115
..............................
6713ATOM6714CG...CNaN6824
6714ATOM6715CD...CNaN6825
6715ATOM6716NE2...NNaN6826
6716ATOM6717OE1...ONaN6827
6717ATOM6718OXT...ONaN6828
\n", "

6718 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 ATOM 1 N ... N NaN 111\n", "1 ATOM 2 CA ... C NaN 112\n", "2 ATOM 3 C ... C NaN 113\n", "3 ATOM 4 CB ... C NaN 114\n", "4 ATOM 5 O ... O NaN 115\n", "... ... ... ... ... ... ... ... ... ...\n", "6713 ATOM 6714 CG ... C NaN 6824\n", "6714 ATOM 6715 CD ... C NaN 6825\n", "6715 ATOM 6716 NE2 ... N NaN 6826\n", "6716 ATOM 6717 OE1 ... O NaN 6827\n", "6717 ATOM 6718 OXT ... O NaN 6828\n", "\n", "[6718 rows x 21 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df[\"ATOM\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 a) -- Loading a PDB structure from a local file\n", "\n", "\n", "Alternatively, we can load PDB files from local directories as regular PDB files using [`read_pdb`](../api/biopandas.pdb#pandaspdbread_pdb):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.read_pdb('./data/3eiy.pdb')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy.pdb)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 b) -- Loading a PDB structure from a local gzipped PDB file\n", "\n", "Or, we can load them from gzip archives like so (note that the file must end with a '.gz' suffix in order to be recognized as a gzip file):" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.read_pdb('./data/3eiy.pdb.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.pdb.gz?raw=true)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the file was succesfully loaded, we have access to the following attributes:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PDB Code: 3eiy\n", "PDB Header Line: HYDROLASE 17-SEP-08 3EIY\n", "\n", "Raw PDB file contents:\n", "\n", "HEADER HYDROLASE 17-SEP-08 3EIY \n", "TITLE CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHATASE FROM BURKHOLDERIA \n", "TITLE 2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE \n", "COMPND MOL_ID: 1; \n", "COMPND 2 MOLECULE: INORGANIC PYROPHOSPHATASE; \n", "COMPND 3 CHAIN: A; \n", "COMPND 4 EC: 3.6.1.1; \n", "COMPND 5 ENGINEERED: YES \n", "SOURCE MOL_ID: 1; \n", "SOURCE 2 ORGANISM_SCIENTIFIC: BURKHOLDERIA PSEUDOMALLEI 1710B; \n", "SOURCE 3 ORGANISM_TAXID: 320372; \n", "SOURCE 4 GENE: PPA, BURPS1710B_1237; \n", "SOURCE 5 EXPRESSION_SYSTEM\n", "...\n" ] } ], "source": [ "print('PDB Code: %s' % ppdb.code)\n", "print('PDB Header Line: %s' % ppdb.header)\n", "print('\\nRaw PDB file contents:\\n\\n%s\\n...' % ppdb.pdb_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most interesting / useful attribute is the [`PandasPdb.df`](../api/biopandas.pdb#pandaspdbdf) DataFrame dictionary though, which gives us access to the PDB files as pandas DataFrames. Let's print the first 3 lines from the `ATOM` coordinate section to see how it looks like:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0ATOM1N...NNaN609
1ATOM2CA...CNaN610
2ATOM3C...CNaN611
\n", "

3 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 ATOM 1 N ... N NaN 609\n", "1 ATOM 2 CA ... C NaN 610\n", "2 ATOM 3 C ... C NaN 611\n", "\n", "[3 rows x 21 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ATOM'].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But more on that in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 -- Loading a PDB file from a Python list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since biopandas 0.3.0, PDB files can also be loaded into a PandasPdb object from a Python list:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0ATOM1N...NNaN609
1ATOM2CA...CNaN610
2ATOM3C...CNaN611
3ATOM4O...ONaN612
4ATOM5CB...CNaN613
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 ATOM 1 N ... N NaN 609\n", "1 ATOM 2 CA ... C NaN 610\n", "2 ATOM 3 C ... C NaN 611\n", "3 ATOM 4 O ... O NaN 612\n", "4 ATOM 5 CB ... C NaN 613\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open('./data/3eiy.pdb', 'r') as f:\n", " three_eiy = f.readlines()\n", "\n", "ppdb2 = PandasPdb()\n", "ppdb2.read_pdb_from_list(three_eiy)\n", "\n", "ppdb2.df['ATOM'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5 -- Obtaining a PDB file from a mmCIF structure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since v0.5.0, it is now also possible to obtain a `PandasPdb` object from a mmCIF file, using `PandasMmcift`'s `PandasMmcif.get_pandas_pdb()`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Type: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_namealt_locresidue_nameblank_2chain_idresidue_numberinsertion...x_coordy_coordz_coordoccupancyb_factorblank_4segment_idelement_symbolchargeline_idx
0ATOM1NSERA2...2.52754.656-1.6671.052.73NNaN0
1ATOM2CASERA2...3.25954.783-0.3681.052.54CNaN1
2ATOM3CSERA2...4.12753.553-0.1051.052.03CNaN2
3ATOM4OSERA2...5.27453.451-0.5941.052.45ONaN3
4ATOM5CBSERA2...2.27354.9440.7921.052.69CNaN4
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name alt_loc residue_name blank_2 \\\n", "0 ATOM 1 N SER \n", "1 ATOM 2 CA SER \n", "2 ATOM 3 C SER \n", "3 ATOM 4 O SER \n", "4 ATOM 5 CB SER \n", "\n", " chain_id residue_number insertion ... x_coord y_coord z_coord \\\n", "0 A 2 ... 2.527 54.656 -1.667 \n", "1 A 2 ... 3.259 54.783 -0.368 \n", "2 A 2 ... 4.127 53.553 -0.105 \n", "3 A 2 ... 5.274 53.451 -0.594 \n", "4 A 2 ... 2.273 54.944 0.792 \n", "\n", " occupancy b_factor blank_4 segment_id element_symbol charge line_idx \n", "0 1.0 52.73 N NaN 0 \n", "1 1.0 52.54 C NaN 1 \n", "2 1.0 52.03 C NaN 2 \n", "3 1.0 52.45 O NaN 3 \n", "4 1.0 52.69 C NaN 4 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "\n", "mmcif = PandasMmcif().fetch_mmcif(\"3EIY\")\n", "pdb = mmcif.convert_to_pandas_pdb()\n", "\n", "print(\"Type:\", type(pdb))\n", "pdb.df[\"ATOM\"].head() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at PDBs in DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PDB files are parsed according to the [PDB file format description](http://www.rcsb.org/pdb/static.do?p=file_formats/pdb/index.html). More specifically, BioPandas reads the columns of the ATOM and HETATM sections as shown in the following excerpt from [http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM](http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| COLUMNS | DATA TYPE | CONTENTS | biopandas column name |\n", "|---------|--------------|--------------------------------------------|-----------------------|\n", "| 1 - 6 | Record name | \"ATOM\" | record_name |\n", "| 7 - 11 | Integer | Atom serial number. | atom_number |\n", "| 12 | | | blank_1 |\n", "| 13 - 16 | Atom | Atom name. | atom_name |\n", "| 17 | Character | Alternate location indicator. | alt_loc |\n", "| 18 - 20 | Residue name | Residue name. | residue_name |\n", "| 21 | | | blank_2 |\n", "| 22 | Character | Chain identifier. | chain_id |\n", "| 23 - 26 | Integer | Residue sequence number. | residue_number |\n", "| 27 | AChar | Code for insertion of residues. | insertion |\n", "| 28 - 30 | | | blank_3 |\n", "| 31 - 38 | Real(8.3) | Orthogonal coordinates for X in Angstroms. | x_coord |\n", "| 39 - 46 | Real(8.3) | Orthogonal coordinates for Y in Angstroms. | y_coord |\n", "| 47 - 54 | Real(8.3) | Orthogonal coordinates for Z in Angstroms. | z_coord |\n", "| 55 - 60 | Real(6.2) | Occupancy. | occupancy |\n", "| 61 - 66 | Real(6.2) | Temperature factor (Default = 0.0). | bfactor |\n", "| 67-72 | | | blank_4 |\n", "| 73 - 76 | LString(4) | Segment identifier, left-justified. | segment_id |\n", "| 77 - 78 | LString(2) | Element symbol, right-justified. | element_symbol |\n", "| 79 - 80 | LString(2) | Charge on the atom. | charge |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is an example of how this would look like in an actual PDB file:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Example: \n", " 1 2 3 4 5 6 7 8\n", " 12345678901234567890123456789012345678901234567890123456789012345678901234567890\n", " ATOM 145 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N\n", " ATOM 146 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C\n", " ATOM 147 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C\n", " ATOM 148 O VAL A 25 29.520 15.059 59.174 1.00 15.65 A1 O\n", " ATOM 149 CB AVAL A 25 30.385 17.437 57.230 0.28 13.88 A1 C\n", " ATOM 150 CB BVAL A 25 30.166 17.399 57.373 0.72 15.41 A1 C\n", " ATOM 151 CG1AVAL A 25 28.870 17.401 57.336 0.28 12.64 A1 C\n", " ATOM 152 CG1BVAL A 25 30.805 18.788 57.449 0.72 15.11 A1 C\n", " ATOM 153 CG2AVAL A 25 30.835 18.826 57.661 0.28 13.58 A1 C\n", " ATOM 154 CG2BVAL A 25 29.909 16.996 55.922 0.72 13.25 A1 C" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After loading a PDB file from rcsb.org or our local drive, the [`PandasPdb.df`](../api/biopandas.pdb/#pandaspdbdf) attribute should contain the following 4 DataFrame objects:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb()\n", "ppdb.read_pdb('./data/3eiy.pdb')\n", "ppdb.df.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy.pdb)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 'ATOM': contains the entries from the ATOM coordinate section\n", "- 'HETATM': ... entries from the \"HETATM\" coordinate section \n", "- 'ANISOU': ... entries from the \"ANISOU\" coordinate section \n", "- 'OTHERS': Everything else that is *not* a 'ATOM', 'HETATM', or 'ANISOU' entry" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](./img/df_dict.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns of the 'HETATM' DataFrame are indentical to the 'ATOM' DataFrame that we've seen earlier:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0HETATM1332K...KNaN1940
1HETATM1333NA...NANaN1941
\n", "

2 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 HETATM 1332 K ... K NaN 1940\n", "1 HETATM 1333 NA ... NA NaN 1941\n", "\n", "[2 rows x 21 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['HETATM'].head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that \"ANISOU\" entries are handled a bit differently as specified at [http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM](http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...blank_4element_symbolchargeline_idx
\n", "

0 rows × 21 columns

\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [record_name, atom_number, blank_1, atom_name, alt_loc, residue_name, blank_2, chain_id, residue_number, insertion, blank_3, U(1,1), U(2,2), U(3,3), U(1,2), U(1,3), U(2,3), blank_4, element_symbol, charge, line_idx]\n", "Index: []\n", "\n", "[0 rows x 21 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ANISOU'].head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not every PDB file contains ANISOU entries (similarly, some PDB files may only contain HETATM or ATOM entries). If records are basent, the DataFrame will be empty as show above." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ANISOU'].empty" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the DataFrames are fairly wide, let's us take a look at the columns by accessing the DataFrame's `column` attribute:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['record_name', 'atom_number', 'blank_1', 'atom_name', 'alt_loc', 'residue_name', 'blank_2', 'chain_id', 'residue_number', 'insertion', 'blank_3', 'U(1,1)', 'U(2,2)', 'U(3,3)', 'U(1,2)', 'U(1,3)', 'U(2,3)', 'blank_4', 'element_symbol', 'charge', 'line_idx'], dtype='object')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ANISOU'].columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANISOU records are very similar to ATOM/HETATM records. In fact, the columns 7 - 27 and 73 - 80 are identical to their corresponding ATOM/HETATM records, which means that the 'ANISOU' DataFrame doesn't have the following entries:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'b_factor', 'occupancy', 'segment_id', 'x_coord', 'y_coord', 'z_coord'}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(ppdb.df['ATOM'].columns).difference(set(ppdb.df['ANISOU'].columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, the \"ANISOU\" DataFrame contains the anisotropic temperature factors \"U(-,-)\" -- note that these are scaled by a factor of $10^4$ ($\\text{Angstroms}^2$) by convention." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'U(1,1)', 'U(1,2)', 'U(1,3)', 'U(2,2)', 'U(2,3)', 'U(3,3)'}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(ppdb.df['ANISOU'].columns).difference(set(ppdb.df['ATOM'].columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah, another interesting thing to mention is that the columns already come with the types you'd expect (where `object` essentially \"means\" `str` here):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "record_name object\n", "atom_number int64\n", "blank_1 object\n", "atom_name object\n", "alt_loc object\n", "residue_name object\n", "blank_2 object\n", "chain_id object\n", "residue_number int64\n", "insertion object\n", "blank_3 object\n", "x_coord float64\n", "y_coord float64\n", "z_coord float64\n", "occupancy float64\n", "b_factor float64\n", "blank_4 object\n", "segment_id object\n", "element_symbol object\n", "charge float64\n", "line_idx int64\n", "dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ATOM'].dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically, all good things come in threes, however, there is a 4th DataFrame, an'OTHER' DataFrame, which contains everything that wasn't parsed as 'ATOM', 'HETATM', or 'ANISOU' coordinate section:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameentryline_idx
0HEADERHYDROLASE 17...0
1TITLECRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHA...1
2TITLE2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE2
3COMPNDMOL_ID: 1;3
4COMPND2 MOLECULE: INORGANIC PYROPHOSPHATASE;4
\n", "
" ], "text/plain": [ " record_name entry line_idx\n", "0 HEADER HYDROLASE 17... 0\n", "1 TITLE CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHA... 1\n", "2 TITLE 2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE 2\n", "3 COMPND MOL_ID: 1; 3\n", "4 COMPND 2 MOLECULE: INORGANIC PYROPHOSPHATASE; 4" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['OTHERS'].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although these 'OTHER' entries are typically less useful for structure-related computations, you may still want to take a look at them to get a short summary of the PDB structure and learn about it's potential quirks and gotchas (typically listed in the REMARKs section). Lastly, the \"OTHERS\" DataFrame comes in handy if we want to reconstruct the structure as PDB file as we will see later (note the `line_idx` columns in all of the DataFrames)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with PDB DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous sections, we've seen how to load PDB structures into DataFrames, and how to access them. Now, let's talk about manipulating PDB files in DataFrames." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0ATOM1N...NNaN609
1ATOM2CA...CNaN610
2ATOM3C...CNaN611
3ATOM4O...ONaN612
4ATOM5CB...CNaN613
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 ATOM 1 N ... N NaN 609\n", "1 ATOM 2 CA ... C NaN 610\n", "2 ATOM 3 C ... C NaN 611\n", "3 ATOM 4 O ... O NaN 612\n", "4 ATOM 5 CB ... C NaN 613\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb()\n", "ppdb.read_pdb('./data/3eiy.pdb.gz')\n", "ppdb.df['ATOM'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.pdb.gz?raw=true)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, there's actually not *that* much to say ... \n", "Once we have our PDB file in the DataFrame format, we have the whole convenience of [pandas](http://pandas.pydata.org) right there at our fingertips." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, let's get all Proline residues:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
38ATOM39N...NNaN647
39ATOM40CA...CNaN648
40ATOM41C...CNaN649
41ATOM42O...ONaN650
42ATOM43CB...CNaN651
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "38 ATOM 39 N ... N NaN 647\n", "39 ATOM 40 CA ... C NaN 648\n", "40 ATOM 41 C ... C NaN 649\n", "41 ATOM 42 O ... O NaN 650\n", "42 ATOM 43 CB ... C NaN 651\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ATOM'][ppdb.df['ATOM']['residue_name'] == 'PRO'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or main chain atoms:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
2ATOM3C...CNaN611
8ATOM9C...CNaN617
19ATOM20C...CNaN628
25ATOM26C...CNaN634
33ATOM34C...CNaN642
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "2 ATOM 3 C ... C NaN 611\n", "8 ATOM 9 C ... C NaN 617\n", "19 ATOM 20 C ... C NaN 628\n", "25 ATOM 26 C ... C NaN 634\n", "33 ATOM 34 C ... C NaN 642\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ATOM'][ppdb.df['ATOM']['atom_name'] == 'C'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's also easy to strip our coordinate section from hydrogen atoms if there are any ..." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0ATOM1N...NNaN609
1ATOM2CA...CNaN610
2ATOM3C...CNaN611
3ATOM4O...ONaN612
4ATOM5CB...CNaN613
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 ATOM 1 N ... N NaN 609\n", "1 ATOM 2 CA ... C NaN 610\n", "2 ATOM 3 C ... C NaN 611\n", "3 ATOM 4 O ... O NaN 612\n", "4 ATOM 5 CB ... C NaN 613\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, let's compute the average temperature factor of our protein main chain:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average B-Factor [Main Chain]: 28.83\n" ] } ], "source": [ "mainchain = ppdb.df['ATOM'][(ppdb.df['ATOM']['atom_name'] == 'C') | \n", " (ppdb.df['ATOM']['atom_name'] == 'O') | \n", " (ppdb.df['ATOM']['atom_name'] == 'N') | \n", " (ppdb.df['ATOM']['atom_name'] == 'CA')]\n", "\n", "bfact_mc_avg = mainchain['b_factor'].mean()\n", "print('Average B-Factor [Main Chain]: %.2f' % bfact_mc_avg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Loading PDB files from a Python List**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since biopandas 0.3.0, PDB files can also be loaded into a PandasPdb object from a Python list:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0ATOM1N...NNaN609
1ATOM2CA...CNaN610
2ATOM3C...CNaN611
3ATOM4O...ONaN612
4ATOM5CB...CNaN613
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 ATOM 1 N ... N NaN 609\n", "1 ATOM 2 CA ... C NaN 610\n", "2 ATOM 3 C ... C NaN 611\n", "3 ATOM 4 O ... O NaN 612\n", "4 ATOM 5 CB ... C NaN 613\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open('./data/3eiy.pdb', 'r') as f:\n", " three_eiy = f.readlines()\n", "\n", "ppdb2 = PandasPdb()\n", "ppdb2.read_pdb_from_list(three_eiy)\n", "\n", "ppdb2.df['ATOM'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with PDBs Containing Multiple Models\n", "\n", "(*New in version 0.4.0*)\n", "\n", "Some PDB files, particularly those containing NMR structures, provide an ensemble of models. There are various ways to extract these.\n", "\n", "In these examples we will work with [2JYF](https://www.rcsb.org/structure/2JYF): an RNA structure containing 10 models of the same underlying RNA structure.\n", "\n", "To start, we con obtain a DataFrame denoting the lines of the PDB files corresponding to each model." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " idxs[\"end_idx\"] = ends.line_idx.values\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_namemodel_idxstart_idxend_idx
129MODEL11292896
133MODEL228975664
137MODEL356658432
141MODEL4843311200
145MODEL51120113968
149MODEL61396916736
153MODEL71673719504
157MODEL81950522272
161MODEL92227325040
165MODEL102504127808
\n", "
" ], "text/plain": [ " record_name model_idx start_idx end_idx\n", "129 MODEL 1 129 2896\n", "133 MODEL 2 2897 5664\n", "137 MODEL 3 5665 8432\n", "141 MODEL 4 8433 11200\n", "145 MODEL 5 11201 13968\n", "149 MODEL 6 13969 16736\n", "153 MODEL 7 16737 19504\n", "157 MODEL 8 19505 22272\n", "161 MODEL 9 22273 25040\n", "165 MODEL 10 25041 27808" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "\n", "ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')\n", "ppdb.get_model_start_end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Assigning model IDs to the PDB DataFrames**\n", "\n", "For ease of use, the `label_models()` method adds an additional column, `\"model_id\"` to the dataframes contained within the `PandasPdb` object." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " idxs[\"end_idx\"] = ends.line_idx.values\n" ] }, { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", " ..\n", "27635 10\n", "27636 10\n", "27637 10\n", "27638 10\n", "27639 10\n", "Name: model_id, Length: 27640, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')\n", "\n", "ppdb.label_models()\n", "ppdb.df[\"ATOM\"][\"model_id\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subsetting `PandasPdb` objects to a given model**\n", "\n", "We can obtain new `PandasPdb` objects containing only a given model using the `get_model()` method" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " idxs[\"end_idx\"] = ends.line_idx.values\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...element_symbolchargeline_idxmodel_id
8292ATOM1O5'...ONaN84344
8293ATOM2C5'...CNaN84354
8294ATOM3C4'...CNaN84364
8295ATOM4O4'...ONaN84374
8296ATOM5C3'...CNaN84384
..............................
11051ATOM2761HO2'...HNaN111944
11052ATOM2762H1'...HNaN111954
11053ATOM2763H3...HNaN111964
11054ATOM2764H5...HNaN111974
11055ATOM2765H6...HNaN111984
\n", "

2764 rows × 22 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... element_symbol charge line_idx model_id\n", "8292 ATOM 1 O5' ... O NaN 8434 4\n", "8293 ATOM 2 C5' ... C NaN 8435 4\n", "8294 ATOM 3 C4' ... C NaN 8436 4\n", "8295 ATOM 4 O4' ... O NaN 8437 4\n", "8296 ATOM 5 C3' ... C NaN 8438 4\n", "... ... ... ... ... ... ... ... ... ...\n", "11051 ATOM 2761 HO2' ... H NaN 11194 4\n", "11052 ATOM 2762 H1' ... H NaN 11195 4\n", "11053 ATOM 2763 H3 ... H NaN 11196 4\n", "11054 ATOM 2764 H5 ... H NaN 11197 4\n", "11055 ATOM 2765 H6 ... H NaN 11198 4\n", "\n", "[2764 rows x 22 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')\n", "\n", "model_4 = ppdb.get_model(model_index=4)\n", "model_4.df[\"ATOM\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subsetting `PandasPdb` objects to a list of given models**\n", "\n", "We can obtain new `PandasPdb` objects containing only a given models using the `get_models()` method" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " idxs[\"end_idx\"] = ends.line_idx.values\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...element_symbolchargeline_idxmodel_id
2764ATOM1O5'...ONaN28982
2765ATOM2C5'...CNaN28992
2766ATOM3C4'...CNaN29002
2767ATOM4O4'...ONaN29012
2768ATOM5C3'...CNaN29022
..............................
22107ATOM2761HO2'...HNaN222668
22108ATOM2762H1'...HNaN222678
22109ATOM2763H3...HNaN222688
22110ATOM2764H5...HNaN222698
22111ATOM2765H6...HNaN222708
\n", "

11056 rows × 22 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... element_symbol charge line_idx model_id\n", "2764 ATOM 1 O5' ... O NaN 2898 2\n", "2765 ATOM 2 C5' ... C NaN 2899 2\n", "2766 ATOM 3 C4' ... C NaN 2900 2\n", "2767 ATOM 4 O4' ... O NaN 2901 2\n", "2768 ATOM 5 C3' ... C NaN 2902 2\n", "... ... ... ... ... ... ... ... ... ...\n", "22107 ATOM 2761 HO2' ... H NaN 22266 8\n", "22108 ATOM 2762 H1' ... H NaN 22267 8\n", "22109 ATOM 2763 H3 ... H NaN 22268 8\n", "22110 ATOM 2764 H5 ... H NaN 22269 8\n", "22111 ATOM 2765 H6 ... H NaN 22270 8\n", "\n", "[11056 rows x 22 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')\n", "\n", "model_ensemble = ppdb.get_models(model_indices=[2, 4, 6, 8])\n", "model_ensemble.df[\"ATOM\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we are using pandas under the hood, which in turns uses matplotlib under the hood, we can produce quick summary plots of our PDB structures relatively conveniently:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.pdb.gz?raw=true)]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "from matplotlib import style\n", "style.use('ggplot')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ppdb.df['ATOM']['b_factor'].plot(kind='hist')\n", "plt.title('Distribution of B-Factors')\n", "plt.xlabel('B-factor')\n", "plt.ylabel('count')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ppdb.df['ATOM']['b_factor'].plot(kind='line')\n", "plt.title('B-Factors Along the Amino Acid Chain')\n", "plt.xlabel('Residue Number')\n", "plt.ylabel('B-factor in $A^2$')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ppdb.df['ATOM']['element_symbol'].value_counts().plot(kind='bar')\n", "plt.title('Distribution of Atom Types')\n", "plt.xlabel('elements')\n", "plt.ylabel('count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing the Root Mean Square Deviation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "BioPandas also comes with certain convenience functions, for example, ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Root-mean-square deviation (RMSD) is simply a measure of the average distance between atoms of 2 protein or ligand structures. This calculation of the Cartesian error follows the equation:\n", "\n", "$$\n", "RMSD(a, b) = \\sqrt{\\frac{1}{n} \\sum^{n}_{i=1} \\big((a_{ix})^2 + (a_{iy})^2 + (a_{iz})^2 \\big)}\n", "= \\sqrt{\\frac{1}{n} \\sum^{n}_{i=1} || a_i + b_i||_2^2}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, assuming that the we have the following 2 conformations of a ligand molecule\n", "\n", "![](./img/ligand_rmsd.png)\n", "\n", "we can compute the RMSD as follows:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 2.6444 Angstrom\n" ] } ], "source": [ "from biopandas.pdb import PandasPdb\n", "\n", "l_1 = PandasPdb().read_pdb('./data/lig_conf_1.pdb')\n", "l_2 = PandasPdb().read_pdb('./data/lig_conf_2.pdb')\n", "r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'],\n", " s=None) # all atoms, including hydrogens\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
0HETATM1C1...CNaN0
1HETATM2O1...ONaN1
2HETATM3C2...CNaN2
3HETATM4O2...ONaN3
4HETATM5C3...CNaN4
5HETATM6O3...ONaN5
6HETATM7C4...CNaN6
7HETATM8O4...ONaN7
8HETATM9C5...CNaN8
9HETATM10O5...ONaN9
10HETATM11C6...CNaN10
11HETATM12O6...ONaN11
12HETATM13C7...CNaN12
13HETATM14C8...CNaN13
14HETATM15C9...CNaN14
15HETATM16C10...CNaN15
16HETATM17H1...HNaN16
17HETATM18H2...HNaN17
18HETATM19H3...HNaN18
19HETATM20H4...HNaN19
20HETATM21H5...HNaN20
21HETATM22H6...HNaN21
22HETATM23H7...HNaN22
23HETATM24H8...HNaN23
\n", "

24 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "0 HETATM 1 C1 ... C NaN 0\n", "1 HETATM 2 O1 ... O NaN 1\n", "2 HETATM 3 C2 ... C NaN 2\n", "3 HETATM 4 O2 ... O NaN 3\n", "4 HETATM 5 C3 ... C NaN 4\n", "5 HETATM 6 O3 ... O NaN 5\n", "6 HETATM 7 C4 ... C NaN 6\n", "7 HETATM 8 O4 ... O NaN 7\n", "8 HETATM 9 C5 ... C NaN 8\n", "9 HETATM 10 O5 ... O NaN 9\n", "10 HETATM 11 C6 ... C NaN 10\n", "11 HETATM 12 O6 ... O NaN 11\n", "12 HETATM 13 C7 ... C NaN 12\n", "13 HETATM 14 C8 ... C NaN 13\n", "14 HETATM 15 C9 ... C NaN 14\n", "15 HETATM 16 C10 ... C NaN 15\n", "16 HETATM 17 H1 ... H NaN 16\n", "17 HETATM 18 H2 ... H NaN 17\n", "18 HETATM 19 H3 ... H NaN 18\n", "19 HETATM 20 H4 ... H NaN 19\n", "20 HETATM 21 H5 ... H NaN 20\n", "21 HETATM 22 H6 ... H NaN 21\n", "22 HETATM 23 H7 ... H NaN 22\n", "23 HETATM 24 H8 ... H NaN 23\n", "\n", "[24 rows x 21 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l_1.df['HETATM']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File links: [lig_conf_1.pdb](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/lig_conf_1.pdb), [lig_conf_2.pdb](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/lig_conf_2.pdb)]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 1.7249 Angstrom\n" ] } ], "source": [ "r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], \n", " s='carbon') # carbon atoms only\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 1.9959 Angstrom\n" ] } ], "source": [ "r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], \n", " s='heavy') # heavy atoms only\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can compute the RMSD between 2 related protein structures:\n", "\n", "![](./img/1t48_rmsd.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The hydrogen-free RMSD:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 0.7377 Angstrom\n" ] } ], "source": [ "p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')\n", "p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')\n", "r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='heavy')\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or the RMSD between the main chains only:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 0.4781 Angstrom\n" ] } ], "source": [ "p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')\n", "p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')\n", "r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='main chain')\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering PDBs by Distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `distance` method to compute the distance between each atom (or a subset of atoms) in our data frame and a three-dimensional reference point. For example:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "p_1 = PandasPdb().read_pdb('./data/3eiy.pdb')\n", "\n", "reference_point = (9.362, 41.410, 10.542)\n", "distances = p_1.distance(xyz=reference_point, records=('ATOM',))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy.pdb)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distance method returns a Pandas Series object:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 19.267419\n", "1 18.306060\n", "2 16.976934\n", "3 16.902897\n", "4 18.124171\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distances.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can use this `Series` object, for instance, to select certain atoms in our DataFrame that fall within a desired distance threshold. For example, let's select all atoms that are within 7A of our reference point: " ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_nameatom_numberblank_1atom_name...segment_idelement_symbolchargeline_idx
786ATOM787CB...CNaN1395
787ATOM788CG...CNaN1396
788ATOM789CD1...CNaN1397
789ATOM790CD2...CNaN1398
790ATOM791N...NNaN1399
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx\n", "786 ATOM 787 CB ... C NaN 1395\n", "787 ATOM 788 CG ... C NaN 1396\n", "788 ATOM 789 CD1 ... C NaN 1397\n", "789 ATOM 790 CD2 ... C NaN 1398\n", "790 ATOM 791 N ... N NaN 1399\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_within_7A = p_1.df['ATOM'][distances < 7.0]\n", "all_within_7A.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualized in PyMOL, this subset (yellow surface) would look as follows:\n", " \n", "![](./img/3eiy_7a.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting Amino Acid codes from 3- to 1-letter codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Residues in the `residue_name` field can be converted into 1-letter amino acid codes, which may be useful for further sequence analysis, for example, pair-wise or multiple sequence alignments:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chain_idresidue_name
1378BI
1386BN
1394BY
1406BR
1417BT
\n", "
" ], "text/plain": [ " chain_id residue_name\n", "1378 B I\n", "1386 B N\n", "1394 B Y\n", "1406 B R\n", "1417 B T" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb().fetch_pdb('5mtn')\n", "sequence = ppdb.amino3to1()\n", "sequence.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown above, the `amino3to1` method returns a `DataFrame` containing the `chain_id` and `residue_name` of the translated 1-letter amino acids. If you like to work with the sequence as a Python list of string characters, you could do the following:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['V', 'R', 'H', 'Y', 'T']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequence_list = list(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])\n", "sequence_list[-5:] # last 5 residues of chain A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if you prefer to work with the sequence as a string, you can use the `join` method: " ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To iterate over the sequences of multi-chain proteins, you can use the `unique` method as shown below:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chain ID: A\n", "SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT\n", "\n", "Chain ID: B\n", "SVSSVPTKLEVVAATPTSLLISWDAPAVTVVYYLITYGETGSPWPGGQAFEVPGSKSTATISGLKPGVDYTITVYAHRSSYGYSENPISINYRT\n" ] } ], "source": [ "for chain_id in sequence['chain_id'].unique():\n", " print('\\nChain ID: %s' % chain_id)\n", " print(''.join(sequence.loc[sequence['chain_id'] == chain_id, 'residue_name']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping it up - Saving PDB structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's talk about how to get the PDB structures out of the DataFrame format back into the beloved .pdb format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we loaded a PDB structure, removed it from its hydrogens:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from biopandas.pdb import PandasPdb\n", "ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')\n", "ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.pdb.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.pdb.gz?raw=true)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save the file using the [`PandasPdb.to_pdb`](../api/biopandas.pdb#pandaspdbto_pdb) method:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "ppdb.to_pdb(path='./data/3eiy_stripped.pdb', \n", " records=None, \n", " gz=False, \n", " append_newline=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy_stripped.pdb](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy_stripped.pdb)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, all records (that is, 'ATOM', 'HETATM', 'OTHERS', 'ANISOU') are written if we set `records=None`. Alternatively, let's say we want to get rid of the 'ANISOU' entries and produce a compressed gzip archive of our PDB structure:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "ppdb.to_pdb(path='./data/3eiy_stripped.pdb.gz', \n", " records=['ATOM', 'HETATM', 'OTHERS'], \n", " gz=True, \n", " append_newline=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy_stripped.pdb.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy_stripped.pdb.gz?raw=true)]" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }