{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create Dataset\n", "This notebook extracts from the Protein Data Bank information about the secondary structure of proteins. The ultimate goal is to assign a fold classification from a protein sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "**Rule 2: Document the Process, Not Just the Results.** Here we describe the steps how to produce the dataset.\n", "\n", "\n", "**Rule 7: Build a Pipeline.** Besides documenting all steps, the entire process of dataset creation from the original data files in the /data directory is automated. There are no manual steps.\n", "\n", "\n", "**Rule 8: Share and Explain Your Data.** To enable reproducibility we provide a /data directory with data files and a file that describes the datasets with download locations and dates.\n", "\n", "\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# column names\n", "value_col = \"foldClass\" # fold class to be predicted" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np \n", "import pdbutils " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read a Representative Set of Protein Chains\n", "Protein sequences in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the [PISCES website](http://dunbrack.fccc.edu/PISCES.php) with a maximum of 25% sequence identity and an X-ray resolution of <= 3.0 Å. \n", "\n", "Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server, Nucleic Acids Res. 33, W94-8. [doi: 10.1093/nar/gki402](https://doi.org/10.1093/nar/gki402)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of representative PDB chains: 14051 \n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lengthExptl.resolutionR-factorFreeRvaluepdbChainId
0330XRAY2.20.160.2912AS.A
1366XRAY2.10.190.2616VP.A
2348XRAY2.60.220.341A0I.A
3413XRAY1.70.190.221A12.C
4108XRAY2.00.210.251A1X.A
\n", "
" ], "text/plain": [ " length Exptl. resolution R-factor FreeRvalue pdbChainId\n", "0 330 XRAY 2.2 0.16 0.29 12AS.A\n", "1 366 XRAY 2.1 0.19 0.26 16VP.A\n", "2 348 XRAY 2.6 0.22 0.34 1A0I.A\n", "3 413 XRAY 1.7 0.19 0.22 1A12.C\n", "4 108 XRAY 2.0 0.21 0.25 1A1X.A" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "representatives = pdbutils.read_pisces_representatives('./data/cullpdb_pc25_res3.0_R1.0_d180920_chains14051.gz')\n", "print(\"Number of representative PDB chains:\", representatives.shape[0], \"\\n\")\n", "representatives.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read Protein Sequence and Secondary Structure Data\n", "Secondary structure of proteins is most commonly assigned with the DSSP method.\n", "\n", "Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–637. [doi 0.1002/bip.360221211](https://doi.org/10.1002/bip.360221211)\n", "\n", "DSSP defines [8 classes of secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure):\n", "\n", "| DSSP Class | Code |\n", "| :--- | ---: |\n", "| 4-turn helix (α helix), min. length 4 residues | H |\n", "| Isolated β-bridge (single pair β-sheet hydrogen bond formation) | B |\n", "| Extended strand in parallel or anti-parallel β-sheet conformation, min length 2 residues | E |\n", "| 3-turn helix (310 helix), min. length 3 residues | G |\n", "| 5-turn helix (π helix), min. length 5 residues | I |\n", "| Hydrogen bonded turn (3, 4 or 5 turn) | T |\n", "| Bend (the only non-hydrogen-bond based assignment) | S |\n", "| Coil (residues which are not in any of the above conformations) | C |\n", "\n", "The [RCSB Protein Data Bank](https://www.rcsb.org) provides protein sequences and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe. \n", "\n", "The **secondary_structure** string below is the secondary structure assignment for each amino acid residue in the **sequence**." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of protein chains in PDB: 402007 \n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pdbChainIdsecondary_structuresequence
0101M.ACCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT...MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
1102L.ACCHHHHHHHHHCCEEEEEECTTSCEEEETTEEEESSSCTTTHHHHH...MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
2102M.ACCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT...MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
3103L.ACCHHHHHHHHHCCEEEEEECTTSCEEEETTEECCCCCCCCCHHHHH...MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
4103M.ACCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT...MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
\n", "
" ], "text/plain": [ " pdbChainId secondary_structure \\\n", "0 101M.A CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... \n", "1 102L.A CCHHHHHHHHHCCEEEEEECTTSCEEEETTEEEESSSCTTTHHHHH... \n", "2 102M.A CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... \n", "3 103L.A CCHHHHHHHHHCCEEEEEECTTSCEEEETTEECCCCCCCCCHHHHH... \n", "4 103M.A CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... \n", "\n", " sequence \n", "0 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... \n", "1 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE... \n", "2 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... \n", "3 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK... \n", "4 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sec_struct = pdbutils.read_secondary_structure('./data/ss_dis.txt.gz')\n", "print(\"Number of protein chains in PDB:\", sec_struct.shape[0], \"\\n\")\n", "sec_struct.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find the Intersection between the two Datasets\n", "Note the high level of redundancy in the PDB: the representative set contains about 14,000 protein chains, whereas the entire PDB contains more than 400,000 protein chains.\n", "\n", "By merging the representative set of protein chains with the secondary structure dataset we obtain the intersection between the two sets. Both datasets contain a unique identifier column **pdbChainId** that is used to join the datasets." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of representative chains with secondary structure data: 13791\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pdbChainIdsecondary_structuresequencelengthExptl.resolutionR-factorFreeRvalue
012AS.ACCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC...MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD...330XRAY2.20.160.29
116VP.ACCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...366XRAY2.10.190.26
21A0I.ACTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT...VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT...348XRAY2.60.220.34
31A12.CCCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC...RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV...413XRAY1.70.190.22
41A1X.ACCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE...GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ...108XRAY2.00.210.25
\n", "
" ], "text/plain": [ " pdbChainId secondary_structure \\\n", "0 12AS.A CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC... \n", "1 16VP.A CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... \n", "2 1A0I.A CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT... \n", "3 1A12.C CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... \n", "4 1A1X.A CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... \n", "\n", " sequence length Exptl. \\\n", "0 MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD... 330 XRAY \n", "1 SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... 366 XRAY \n", "2 VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT... 348 XRAY \n", "3 RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... 413 XRAY \n", "4 GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... 108 XRAY \n", "\n", " resolution R-factor FreeRvalue \n", "0 2.2 0.16 0.29 \n", "1 2.1 0.19 0.26 \n", "2 2.6 0.22 0.34 \n", "3 1.7 0.19 0.22 \n", "4 2.0 0.21 0.25 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = sec_struct.merge(representatives, left_on='pdbChainId', right_on='pdbChainId', how='inner')\n", "print(\"Number of representative chains with secondary structure data:\", df.shape[0])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate Secondary Structure Content\n", "To reduce the DSSP 8-state classification to a 3-state classification we group related secondary structure elements: alpha (I, H, G), beta (E, B), and coil (S, T, C). Then we calculate the fraction of the amino acid residues in each of the three classes." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pdbChainIdsecondary_structuresequencelengthExptl.resolutionR-factorFreeRvaluealphabetacoil
012AS.ACCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC...MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD...330XRAY2.20.160.290.3454550.2060610.448485
116VP.ACCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...366XRAY2.10.190.260.4699450.0464480.483607
21A0I.ACTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT...VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT...348XRAY2.60.220.340.2327590.3189660.448276
31A12.CCCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC...RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV...413XRAY1.70.190.220.0387410.4188860.542373
41A1X.ACCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE...GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ...108XRAY2.00.210.250.0370370.4722220.490741
\n", "
" ], "text/plain": [ " pdbChainId secondary_structure \\\n", "0 12AS.A CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC... \n", "1 16VP.A CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... \n", "2 1A0I.A CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT... \n", "3 1A12.C CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... \n", "4 1A1X.A CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... \n", "\n", " sequence length Exptl. \\\n", "0 MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD... 330 XRAY \n", "1 SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... 366 XRAY \n", "2 VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT... 348 XRAY \n", "3 RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... 413 XRAY \n", "4 GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... 108 XRAY \n", "\n", " resolution R-factor FreeRvalue alpha beta coil \n", "0 2.2 0.16 0.29 0.345455 0.206061 0.448485 \n", "1 2.1 0.19 0.26 0.469945 0.046448 0.483607 \n", "2 2.6 0.22 0.34 0.232759 0.318966 0.448276 \n", "3 1.7 0.19 0.22 0.038741 0.418886 0.542373 \n", "4 2.0 0.21 0.25 0.037037 0.472222 0.490741 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def alpha_fraction(s):\n", " return (s.count('I') + s.count('H') + s.count('G')) / len(s)\n", "\n", "\n", "def beta_fraction(s):\n", " return (s.count('E') + s.count('B')) / len(s)\n", "\n", "\n", "def coil_fraction(s):\n", " return (s.count('S') + s.count('T') + s.count('C')) / len(s)\n", "\n", "\n", "df['alpha'] = df.secondary_structure.apply(alpha_fraction) \n", "df['beta'] = df.secondary_structure.apply(beta_fraction) \n", "df['coil'] = df.secondary_structure.apply(coil_fraction) \n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classify Sequences by Secondary Structure Content\n", "Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.\n", "\n", "* alpha: predominantly alpha (> 25%)\n", "* beta: predominantly beta (> 25%)\n", "* alpha+beta: significant alpha (> 25%) and beta (> 25%)\n", "* other: cases that do not fit into the 3 classes above \n", "\n", "Protein chains in the **other** class will be ignored in the subsequent analysis." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def protein_fold_class(data, min_threshold, max_threshold):\n", " \"\"\"\n", " Returns fold classification:\n", " \"alpha\", \"beta\", \"alpha+beta\", and \"other\" based upon the \n", " fraction of alpha and beta.\n", " \"\"\"\n", " if data.alpha > max_threshold and data.beta < min_threshold: \n", " return \"alpha\" \n", " elif data.beta > max_threshold and data.alpha < min_threshold: \n", " return \"beta\" \n", " elif data.alpha > max_threshold and data.beta > max_threshold: \n", " return \"alpha+beta\" \n", " else: \n", " return \"other\"\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset size 5370\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pdbChainIdsecondary_structuresequencelengthExptl.resolutionR-factorFreeRvaluealphabetacoilfoldClass
116VP.ACCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...366XRAY2.100.190.260.4699450.0464480.483607alpha
31A12.CCCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC...RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV...413XRAY1.700.190.220.0387410.4188860.542373beta
41A1X.ACCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE...GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ...108XRAY2.000.210.250.0370370.4722220.490741beta
51A2X.BCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHTCCCCCCCCCCCCCCCGDEEKRNRAITARRQHLKSVMLQIAATELEKEEGRREAEKQNYLAEH47XRAY2.300.220.330.5744680.0000000.425532alpha
81A62.ACBHHHHHTSCHHHHHHHHHTTTCCCCTTSCHHHHHHHHHHHHHHTT...MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAKSG...130XRAY1.550.220.250.2846150.2615380.453846alpha+beta
\n", "
" ], "text/plain": [ " pdbChainId secondary_structure \\\n", "1 16VP.A CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... \n", "3 1A12.C CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... \n", "4 1A1X.A CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... \n", "5 1A2X.B CCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHTCCCCCCCCCCCCCCC \n", "8 1A62.A CBHHHHHTSCHHHHHHHHHTTTCCCCTTSCHHHHHHHHHHHHHHTT... \n", "\n", " sequence length Exptl. \\\n", "1 SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... 366 XRAY \n", "3 RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... 413 XRAY \n", "4 GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... 108 XRAY \n", "5 GDEEKRNRAITARRQHLKSVMLQIAATELEKEEGRREAEKQNYLAEH 47 XRAY \n", "8 MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAKSG... 130 XRAY \n", "\n", " resolution R-factor FreeRvalue alpha beta coil foldClass \n", "1 2.10 0.19 0.26 0.469945 0.046448 0.483607 alpha \n", "3 1.70 0.19 0.22 0.038741 0.418886 0.542373 beta \n", "4 2.00 0.21 0.25 0.037037 0.472222 0.490741 beta \n", "5 2.30 0.22 0.33 0.574468 0.000000 0.425532 alpha \n", "8 1.55 0.22 0.25 0.284615 0.261538 0.453846 alpha+beta " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# assign protein fold class\n", "df[value_col] = df.apply(protein_fold_class, min_threshold=0.05, max_threshold=0.25, axis=1)\n", "\n", "# exclude protein chains without a dominant classification from further analysis.\n", "df = df[df[value_col] != 'other']\n", "\n", "print(\"Dataset size\", df.shape[0])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save Dataset\n", "We save the representative dataset with protein sequence and fold classification as a Pandas dataframe for further analysis." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df.to_json(\"./intermediate_data/foldClassification.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next step\n", "After you saved the dataset here, run the next step in the workflow [2-CalculateFeatures.ipynb](./2-CalculateFeatures.ipynb) or go back to [0-Workflow.ipynb](./0-Workflow.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Authors:** [Peter W. Rose](mailto:pwrose.ucsd@gmail.com), Shih-Cheng Huang, UC San Diego, October 1, 2018\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }