{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "BioPandas\n", "\n", "Authors: \n", "- Sebastian Raschka \n", "- Arian Jamasb \n", "\n", "License: BSD 3 clause \n", "Project Website: http://rasbt.github.io/biopandas/ \n", "Code Repository: https://github.com/rasbt/biopandas " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last updated: 2022-05-12\n", "\n", "pandas : 1.4.0\n", "biopandas: 0.4.0\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -d -u -p pandas,biopandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "\n", "pd.set_option('display.width', 600)\n", "pd.set_option('display.max_columns', 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with mmCIF Structures in DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading mmCIF Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several ways to load a mmCIF structure into a `PandasMmcif` object.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1 -- Loading an mmCIF file from the Protein Data Bank\n", "\n", "MmCIF files can be directly fetched from The Protein Data Bank at [http://www.rcsb.org](http://www.rcsb.org) via its unique 4-letter after initializing a new [`PandasMmcif`](../api_subpackages/biopandas.mmcif) object and calling the `fetch_mmcif` method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "# Initialize a new PandasMmcif object\n", "# and fetch the mmCIF file from rcsb.org\n", "pmmcif = PandasMmcif().fetch_mmcif('3eiy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2 -- Loading an mmCIF file from the AlphaFold Structure Database\n", "\n", "\n", "(*New in version 0.4.0*)\n", "\n", "PDB files can be directly fetched from The AlphaFold Structure Database at [https://alphafold.ebi.ac.uk/](https://alphafold.ebi.ac.uk/) via its unique [UniProt](https://www.uniprot.org/) Identifier after initializing a new [`PandasPdb`](../api/biopandas.pdb#pandaspdb) object and calling the [`fetch_af2`](../api/biopandas.pdb#pandaspdbfetch_pdb) method:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "# Initialize a new PandasPdb object\n", "# and fetch the PDB file from alphafold.ebi.ac.uk\n", "ppdb = PandasMmcif().fetch_mmcif(uniprot_id='Q5VSL9', source='alphafold2-v2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 a) -- Loading a mmCIF structure from a local file\n", "\n", "Alternatively, we can load mmCIF files from local directories as regular mmCIF files using `read_mmcif`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.read_mmcif('./data/3eiy.cif')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.cif](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy.cif)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 b) -- Loading a mmCIF structure from a local gzipped mmCIF file\n", "\n", "Or, we can load them from gzip archives like so (note that the file must end with a '.gz' suffix in order to be recognized as a gzip file):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.read_mmcif('./data/3eiy.cif.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.cif.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.cif.gz?raw=true)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the file was succesfully loaded, we have access to the following attributes:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mmCIF Code: 3eiy\n", "mmCIF Header Line: \n", "\n", "Raw mmCIF file contents:\n", "\n", "data_3EIY\n", "# \n", "_entry.id 3EIY \n", "# \n", "_audit_conform.dict_name mmcif_pdbx.dic \n", "_audit_conform.dict_version 5.281 \n", "_audit_conform.dict_location http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic \n", "# \n", "loop_\n", "_database_2.database_id \n", "_database_2.database_code \n", "PDB 3EIY \n", "RCSB RCSB049380 \n", "WWPDB D_1000049380 \n", "# \n", "loop_\n", "_pdbx_database_related.db_name \n", "_pdbx_database_related.db_id \n", "_pdbx_database_related.details \n", "_pdbx_database_related.content_type \n", "TargetDB BupsA.00023.a . unspecified \n", "PDB 3d63 \n", ";The same protein, \"open\" conformation, apo form, in space group P21212\n", ";\n", "unspecified \n", "PDB 3EIZ . unspecified \n", "PDB 3EJ0 . unspecified \n", "PDB 3EJ2 . \n", "...\n" ] } ], "source": [ "print('mmCIF Code: %s' % pmmcif.code)\n", "print('mmCIF Header Line: %s' % pmmcif.header)\n", "print('\\nRaw mmCIF file contents:\\n\\n%s\\n...' % pmmcif.pdb_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most interesting / useful attribute is the `PandasMmcif.df` DataFrame dictionary though, which gives us access to the mmCIF files as pandas DataFrames. Let's print the first 3 lines from the `ATOM` coordinate section to see how it looks like:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
0ATOM1NN...SERAN1
1ATOM2CCA...SERACA1
2ATOM3CC...SERAC1
\n", "

3 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "0 ATOM 1 N N ... SER A N 1\n", "1 ATOM 2 C CA ... SER A CA 1\n", "2 ATOM 3 C C ... SER A C 1\n", "\n", "[3 rows x 21 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.df['ATOM'].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But more on that in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 -- Loading a mmCIF file from a Python List" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mmcif files can also be loaded into a `PandasMmcif` object from a Python list:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
0ATOM1NN...SERAN1
1ATOM2CCA...SERACA1
2ATOM3CC...SERAC1
3ATOM4OO...SERAO1
4ATOM5CCB...SERACB1
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "0 ATOM 1 N N ... SER A N 1\n", "1 ATOM 2 C CA ... SER A CA 1\n", "2 ATOM 3 C C ... SER A C 1\n", "3 ATOM 4 O O ... SER A O 1\n", "4 ATOM 5 C CB ... SER A CB 1\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open('./data/3eiy.cif', 'r') as f:\n", " three_eiy = f.read()\n", "\n", "pmmcif2 = PandasMmcif()\n", "pmmcif2.read_mmcif_from_list(three_eiy)\n", "\n", "pmmcif2.df['ATOM'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at mmCIF files in DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "mmCIF files are parsed according to the [mmCIF file format description](https://mmcif.wwpdb.org). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information, we recommend the helpful [Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/beginner’s-guide-to-pdb-structures-and-the-pdbx-mmcif-format) guide." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After loading a PDB file from rcsb.org or our local drive, the [`PandasPdb.df`](../api/biopandas.pdb/#pandaspdbdf) attribute should contain the following 3 DataFrame objects:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['ATOM', 'HETATM', 'ANISOU'])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "\n", "pmmcif = PandasMmcif()\n", "pmmcif.read_mmcif('./data/3eiy.cif')\n", "pmmcif.df.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.cif](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy.cif)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 'ATOM': contains the entries from the ATOM coordinate section\n", "- 'HETATM': ... entries from the \"HETATM\" coordinate section \n", "- 'ANISOU': ... entries from the \"ANISOU\" coordinate section " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns for `'ATOM'` DataFrame are as follows:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id', 'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id', 'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy', 'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id', 'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num'], dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.df['ATOM'].columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **'group_PDB'**:\n", "The group of atoms to which the atom site belongs. This data\n", " item is provided for compatibility with the original Protein\n", " Data Bank format, and only for that purpose.\n", "- **'id'**: The value of _atom_site.id must uniquely identify a record in the\n", " ATOM_SITE list. Note that this item need not be a number; it can be any unique\n", " identifier.\n", "- **'type_symbol'**: The code used to identify the atom species (singular or plural)\n", " representing this atom type. Normally this code is the element\n", " symbol. The code may be composed of any character except\n", " an underscore with the additional proviso that digits designate\n", " an oxidation state and must be followed by a + or - character.\n", "- **'label_atom_id'**: An atom name identifier, e.g., N, CA, C, O, ...\n", "- **'label_alt_id'**: A place holder to indicate alternate conformation. The alternate conformation\n", " can be an entire polymer chain, or several residues or\n", " partial residue (several atoms within one residue). If\n", " an atom is provided in more than one position, then a\n", " non-blank alternate location indicator must be used for\n", " each of the atomic positions.\n", "- **'label_comp_id'**: For protein polymer entities, this is the three-letter code for\n", " the amino acid. For nucleic acid polymer entities, this is the one-letter code\n", " for the base.\n", "- **'label_asym_id'**: A value that uniquely identifies a record in\n", " the STRUCT_ASYM list.\n", "- **'label_entity_id'**: A value that uniquely identifies a record in\n", " the ENTITY list.\n", "- **'label_seq_id'**: A value that uniquely identifies a record in\n", " the ENTITY_POLY_SEQ list.\n", "- **'pdbx_PDB_ins_code'**: PDB insertion code.\n", "- **'Cartn_x'**: The x atom-site coordinate in angstroms\n", "- **'Cartn_y'**: The y atom-site coordinate in angstroms\n", "- **'Cartn_z'**: The z atom-site coordinate in angstroms\n", "- **'occupancy'**: The fraction of the atom type present at this site.\n", " The sum of the occupancies of all the atom types at this site\n", " may not significantly exceed 1.0 unless it is a dummy site.\n", "- **'B_iso_or_equiv'**: Isotropic atomic displacement parameter, or equivalent isotropic\n", " atomic displacement parameter, B_eq, calculated from the\n", " anisotropic displacement parameters. \n", "- **'pdbx_formal_charge'**: The net integer charge assigned to this atom. This is the\n", " formal charge assignment normally found in chemical diagrams.\n", "- **'auth_seq_id'**: An alternative identifier for _atom_site.label_seq_id that\n", " may be provided by an author in order to match the identification\n", " used in the publication that describes the structure.\n", "- **'auth_comp_id'**: An alternative identifier for _atom_site.label_comp_id that\n", " may be provided by an author in order to match the identification\n", " used in the publication that describes the structure.\n", "- **'auth_asym_id'**: An alternative identifier for _atom_site.label_asym_id that\n", " may be provided by an author in order to match the identification\n", " used in the publication that describes the structure.\n", "- **'auth_atom_id'**: An alternative identifier for _atom_site.label_atom_id that\n", " may be provided by an author in order to match the identification\n", " used in the publication that describes the structure.\n", "- **'pdbx_PDB_model_num'**: PDB model number." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns of the 'HETATM' DataFrame are indentical to the 'ATOM' DataFrame that we've seen earlier:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
1330HETATM1331KK...KAK1
1331HETATM1332NANA...NAANA1
\n", "

2 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "1330 HETATM 1331 K K ... K A K 1\n", "1331 HETATM 1332 NA NA ... NA A NA 1\n", "\n", "[2 rows x 21 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.df['HETATM'].head(2)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(pmmcif.df['HETATM'].columns) == set(pmmcif.df['ATOM'].columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, there are a few naming differences in the ANISOU columns, for instance, the `'ATOM'` and `'HETATM'` DataFrames feature the following columns that are not contained in ANISOU:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'B_iso_or_equiv',\n", " 'Cartn_x',\n", " 'Cartn_y',\n", " 'Cartn_z',\n", " 'auth_asym_id',\n", " 'auth_atom_id',\n", " 'auth_comp_id',\n", " 'auth_seq_id',\n", " 'group_PDB',\n", " 'label_alt_id',\n", " 'label_asym_id',\n", " 'label_atom_id',\n", " 'label_comp_id',\n", " 'label_entity_id',\n", " 'label_seq_id',\n", " 'occupancy',\n", " 'pdbx_PDB_model_num',\n", " 'pdbx_formal_charge'}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(pmmcif.df['ATOM'].columns) - set(pmmcif.df['ANISOU'].columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vice versa, ANISOU contains the following columns that are not in the `'ATOM'` and `'HETATM'` DataFrames:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'U[1][1]',\n", " 'U[1][2]',\n", " 'U[1][3]',\n", " 'U[2][2]',\n", " 'U[2][3]',\n", " 'U[3][3]',\n", " 'pdbx_auth_asym_id',\n", " 'pdbx_auth_atom_id',\n", " 'pdbx_auth_comp_id',\n", " 'pdbx_auth_seq_id',\n", " 'pdbx_label_alt_id',\n", " 'pdbx_label_asym_id',\n", " 'pdbx_label_atom_id',\n", " 'pdbx_label_comp_id',\n", " 'pdbx_label_seq_id'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(pmmcif.df['ANISOU'].columns) - set(pmmcif.df['ATOM'].columns) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "BioPandas tries to stay to the original column names as close as possible, and for more details, we recommend checking the original descriptions:\n", "\n", "- [ATOM/HETATM](https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ATOMP)\n", "- [ANISOU](https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ANISOU)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with mmCIF DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous sections, we've seen how to load mmCIF structures into DataFrames, and how to access them. Now, let's talk about manipulating mmCIF files in DataFrames." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
0ATOM1NN...SERAN1
1ATOM2CCA...SERACA1
2ATOM3CC...SERAC1
3ATOM4OO...SERAO1
4ATOM5CCB...SERACB1
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "0 ATOM 1 N N ... SER A N 1\n", "1 ATOM 2 C CA ... SER A CA 1\n", "2 ATOM 3 C C ... SER A C 1\n", "3 ATOM 4 O O ... SER A O 1\n", "4 ATOM 5 C CB ... SER A CB 1\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.mmcif import PandasMmcif\n", "pmmcif = PandasMmcif()\n", "pmmcif.read_mmcif('./data/3eiy.cif.gz')\n", "pmmcif.df['ATOM'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.cif.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.cif.gz?raw=true)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, there's actually not *that* much to say ... \n", "Once we have our mmCIF file in the DataFrame format, we have the whole convenience of [pandas](http://pandas.pydata.org) right there at our fingertips." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, let's get all Proline residues:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
38ATOM39NN...PROAN1
39ATOM40CCA...PROACA1
40ATOM41CC...PROAC1
41ATOM42OO...PROAO1
42ATOM43CCB...PROACB1
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "38 ATOM 39 N N ... PRO A N 1\n", "39 ATOM 40 C CA ... PRO A CA 1\n", "40 ATOM 41 C C ... PRO A C 1\n", "41 ATOM 42 O O ... PRO A O 1\n", "42 ATOM 43 C CB ... PRO A CB 1\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.df['ATOM'][pmmcif.df['ATOM']['auth_comp_id'] == 'PRO'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or main chain atoms:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
1ATOM2CCA...SERACA1
7ATOM8CCA...PHEACA1
18ATOM19CCA...SERACA1
24ATOM25CCA...ASNACA1
32ATOM33CCA...VALACA1
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "1 ATOM 2 C CA ... SER A CA 1\n", "7 ATOM 8 C CA ... PHE A CA 1\n", "18 ATOM 19 C CA ... SER A CA 1\n", "24 ATOM 25 C CA ... ASN A CA 1\n", "32 ATOM 33 C CA ... VAL A CA 1\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.df['ATOM'][pmmcif.df['ATOM']['label_atom_id'] == 'CA'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's also easy to strip our coordinate section from hydrogen atoms if there are any ..." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
0ATOM1NN...SERAN1
1ATOM2CCA...SERACA1
2ATOM3CC...SERAC1
3ATOM4OO...SERAO1
4ATOM5CCB...SERACB1
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "0 ATOM 1 N N ... SER A N 1\n", "1 ATOM 2 C CA ... SER A CA 1\n", "2 ATOM 3 C C ... SER A C 1\n", "3 ATOM 4 O O ... SER A O 1\n", "4 ATOM 5 C CB ... SER A CB 1\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmmcif.df['ATOM'][pmmcif.df['ATOM']['type_symbol'] != 'H'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, let's compute the average temperature factor of our protein main chain:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average B-Factor [Main Chain]: 1.00\n" ] } ], "source": [ "mainchain = pmmcif.df['ATOM'][(pmmcif.df['ATOM']['label_atom_id'] == 'C') | \n", " (pmmcif.df['ATOM']['label_atom_id'] == 'O') | \n", " (pmmcif.df['ATOM']['label_atom_id'] == 'N') | \n", " (pmmcif.df['ATOM']['label_atom_id'] == 'CA')]\n", "\n", "bfact_mc_avg = mainchain['occupancy'].mean()\n", "print('Average B-Factor [Main Chain]: %.2f' % bfact_mc_avg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we are using pandas under the hood, which in turns uses matplotlib under the hood, we can produce quick summary plots of our mmCIF structures relatively conveniently:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "\n", "pmmcif = PandasMmcif().read_mmcif('./data/3eiy.cif.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.cif.gz](https://github.com/rasbt/biopandas/blob/main/docs/tutorials/data/3eiy.cif.gz?raw=true)]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "from matplotlib import style\n", "style.use('ggplot')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pmmcif.df['ATOM']['B_iso_or_equiv'].plot(kind='hist')\n", "plt.title('Distribution of B-Factors')\n", "plt.xlabel('B-factor')\n", "plt.ylabel('count')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pmmcif.df['ATOM']['B_iso_or_equiv'].plot(kind='line')\n", "plt.title('B-Factors Along the Amino Acid Chain')\n", "plt.xlabel('Residue Number')\n", "plt.ylabel('B-factor in $A^2$')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pmmcif.df['ATOM']['type_symbol'].value_counts().plot(kind='bar')\n", "plt.title('Distribution of Atom Types')\n", "plt.xlabel('elements')\n", "plt.ylabel('count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing the Root Mean Square Deviation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "BioPandas also comes with certain convenience functions, for example, ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Root-mean-square deviation (RMSD) is simply a measure of the average distance between atoms of 2 protein or ligand structures. This calculation of the Cartesian error follows the equation:\n", "\n", "$$\n", "RMSD(a, b) = \\sqrt{\\frac{1}{n} \\sum^{n}_{i=1} \\big((a_{ix})^2 + (a_{iy})^2 + (a_{iz})^2 \\big)}\n", "= \\sqrt{\\frac{1}{n} \\sum^{n}_{i=1} || a_i + b_i||_2^2}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, assuming that the we have the following 2 conformations of a ligand molecule\n", "\n", "![](./img/ligand_rmsd.png)\n", "\n", "we can compute the RMSD as follows:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 2.6444 Angstrom\n" ] } ], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "l_1 = PandasMmcif().read_mmcif('./data/lig_conf_1.cif')\n", "l_2 = PandasMmcif().read_mmcif('./data/lig_conf_2.cif')\n", "r = PandasMmcif.rmsd(l_1.df['HETATM'], l_2.df['HETATM'],\n", " s=None) # all atoms, including hydrogens\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File links: [lig_conf_1.cif](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/lig_conf_1.cif), [lig_conf_2.cif](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/lig_conf_2.cif)]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 1.7249 Angstrom\n" ] } ], "source": [ "r = PandasMmcif.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], \n", " s='carbon') # carbon atoms only\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 1.9959 Angstrom\n" ] } ], "source": [ "r = PandasMmcif.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], \n", " s='heavy') # heavy atoms only\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can compute the RMSD between 2 related protein structures:\n", "\n", "![](./img/1t48_rmsd.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The hydrogen-free RMSD:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 0.7377 Angstrom\n" ] } ], "source": [ "p_1 = PandasMmcif().read_mmcif('./data/1t48_995.cif')\n", "p_2 = PandasMmcif().read_mmcif('./data/1t49_995.cif')\n", "r = PandasMmcif.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='heavy')\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or the RMSD between the main chains only:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSD: 0.4781 Angstrom\n" ] } ], "source": [ "p_1 = PandasMmcif().read_mmcif('./data/1t48_995.cif')\n", "p_2 = PandasMmcif().read_mmcif('./data/1t49_995.cif')\n", "r = PandasMmcif.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='main chain')\n", "print('RMSD: %.4f Angstrom' % r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering PDBs by Distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `distance` method to compute the distance between each atom (or a subset of atoms) in our data frame and a three-dimensional reference point. For example:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "p_1 = PandasMmcif().read_mmcif('./data/3eiy.cif')\n", "\n", "reference_point = (9.362, 41.410, 10.542)\n", "distances = p_1.distance(xyz=reference_point, records=('ATOM',))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[File link: [3eiy.cif](https://raw.githubusercontent.com/rasbt/biopandas/main/docs/tutorials/data/3eiy.cif)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distance method returns a Pandas Series object:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 19.267419\n", "1 18.306060\n", "2 16.976934\n", "3 16.902897\n", "4 18.124171\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distances.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can use this `Series` object, for instance, to select certain atoms in our DataFrame that fall within a desired distance threshold. For example, let's select all atoms that are within 7A of our reference point: " ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_PDBidtype_symbollabel_atom_id...auth_comp_idauth_asym_idauth_atom_idpdbx_PDB_model_num
786ATOM787CCB...LEUACB1
787ATOM788CCG...LEUACG1
788ATOM789CCD1...LEUACD11
789ATOM790CCD2...LEUACD21
790ATOM791NN...VALAN1
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " group_PDB id type_symbol label_atom_id ... auth_comp_id auth_asym_id auth_atom_id pdbx_PDB_model_num\n", "786 ATOM 787 C CB ... LEU A CB 1\n", "787 ATOM 788 C CG ... LEU A CG 1\n", "788 ATOM 789 C CD1 ... LEU A CD1 1\n", "789 ATOM 790 C CD2 ... LEU A CD2 1\n", "790 ATOM 791 N N ... VAL A N 1\n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_within_7A = p_1.df['ATOM'][distances < 7.0]\n", "all_within_7A.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualized in PyMOL, this subset (yellow surface) would look as follows:\n", " \n", "![](./img/3eiy_7a.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting Amino Acid codes from 3- to 1-letter codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Residues in the `residue_name` field can be converted into 1-letter amino acid codes, which may be useful for further sequence analysis, for example, pair-wise or multiple sequence alignments:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
auth_asym_idauth_comp_id
1378BI
1386BN
1394BY
1406BR
1417BT
\n", "
" ], "text/plain": [ " auth_asym_id auth_comp_id\n", "1378 B I\n", "1386 B N\n", "1394 B Y\n", "1406 B R\n", "1417 B T" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from biopandas.mmcif import PandasMmcif\n", "\n", "\n", "pmmcif = PandasMmcif().fetch_mmcif('5mtn')\n", "sequence = pmmcif.amino3to1()\n", "sequence.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown above, the `amino3to1` method returns a `DataFrame` containing the `auth_asym_id` (chain ID) and `auth_comp_id` (residue name) of the translated 1-letter amino acids. If you like to work with the sequence as a Python list of string characters, you could do the following:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['V', 'R', 'H', 'Y', 'T']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequence_list = list(sequence.loc[sequence['auth_asym_id'] == 'A', 'auth_comp_id'])\n", "sequence_list[-5:] # last 5 residues of chain A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if you prefer to work with the sequence as a string, you can use the `join` method: " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(sequence.loc[sequence['auth_asym_id'] == 'A', 'auth_comp_id'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To iterate over the sequences of multi-chain proteins, you can use the `unique` method as shown below:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chain ID: A\n", "SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT\n", "\n", "Chain ID: B\n", "SVSSVPTKLEVVAATPTSLLISWDAPAVTVVYYLITYGETGSPWPGGQAFEVPGSKSTATISGLKPGVDYTITVYAHRSSYGYSENPISINYRT\n" ] } ], "source": [ "for chain_id in sequence['auth_asym_id'].unique():\n", " print('\\nChain ID: %s' % chain_id)\n", " print(''.join(sequence.loc[sequence['auth_asym_id'] == chain_id, 'auth_comp_id']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }