## Demo of PDBrenum in your broswer via MyBinder.org

-----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell, click on the cell and either click the <i class="fa-play fa"></i> button on the toolbar above, or then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook. Selecting from the menu above the toolbar, <b>Cell</b> > <b>Run All</b> is a shortcut to trigger attempting to run all the cells in the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

Step through running the cells below. Then substitute in your PDB entry identifiers of interest.

In [1]:
%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB

Downloading PDB files: 100%|██████████| 4/4 [00:01<00:00,  2.07it/s]
Downloading SIFTS files: 100%|██████████| 4/4 [00:00<00:00, 12.80it/s]
Renumbering PDB files: 100%|██████████| 4/4 [00:03<00:00,  1.22it/s]


That's it. Really.  
Below this demonstration notebook will demonstrate that it worked and fill in some information about running the script here, where to find the output, options for running it elsewhere, etc.. But mostly that is it as you'll see. 

There's some other options that are handy. If instead you wanted the converted results i the `mmCIF` format you'd use the following command here:

```python
%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -mmCIF

```

Or simply leave off any reference to format because it defaults to `mmCIF` format if no type is indicated when calling the script. `mmCIF_assembly` and `-PDB_assembly` are also valid types

Note the `%run` part is magic for properly running a script in a Jupyter environnemt. If you were running the first demonstration command in a terminal you'd use the following:

```bash
python PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB
```

Depending on your system and how you installed Python, you may need to replace `python` with `python3`.

The `-rfla` flag in the call the the script above stands for `--renumber_from_list_of_arguments` to indicate we are providing the PDB entry identifiers as part of the command. Use of a text file to provide the PDB ids will be demonstrated [below](#Using-a-list-of-PDB-entry-identifiers).

If you ever need the full ist of options and flags just call the script, with the `help` flag like below to print out the full usage details:

```python
%run PDBrenum.py --help
```
Running the below cell will do that.

In [2]:
%run PDBrenum.py --help


PDB.py
optional arguments:
-h, --help            show this help message and exit

-rftf text_file_with_PDB.txt, --renumber_from_text_file text_file_with_PDB.txt
This option will download and renumber specified files
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -mmCIF
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -PDB
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -mmCIF_assembly
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -PDB_assembly
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -all

-rfla [6dbp 3v03 2jit ...], --renumber_from_list_of_arguments [6dbp 3v03 2jit ...]
This option will download and renumber specified files
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -PDB
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF_assembly
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -PDB_assembly
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -all

-dftf text_file_with_PDB.txt, --downloa


----

### Locating results and showing it worked


Let's demonstrate that the `%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB` command first run worked.

When the script runs, it creates a directory for the data it obtains from the PDB. Because the demo command indicated we wanted the legacy PDB format, the script created a directory called `PDB` as it ran and saved the PDB files there.

We can see that in some steps. First by running the following to list the contents of that working directory:

In [3]:
ls 

[0m[01;34mbinder[0m/     LICENSE             [01;34moutput_PDB[0m/      [01;32mPDBrenum.py[0m*  [01;34msrc[0m/
demo.ipynb  log_corrected.txt   [01;34mPDB[0m/             [01;32mREADME.md[0m*
[01;32minput.txt[0m*  log_translator.txt  [01;32mPDBrenum.ipynb[0m*  [01;34mSIFTS[0m/


(Note that listing the files and directory show `log_corrected.txt` present in the directory along with this `demo.ipynb` notebook. That file harboring useful information will be discussed below in the section 'There's some good information that PDBrenum exposes as part of its process¶
'.)

We see the `PDB` directory and we can check the contents of that with the following command:

In [4]:
ls PDB

pdb1bxw.ent.gz  pdb1d5t.ent.gz  pdb2vl3.ent.gz  pdb5e6h.ent.gz


Those files are compressed in the gzip format; however, using the unix `zcat` command to uncompress in combination with the unix command `head` to grab the start of a text file and display, we can view the start of one of them by running the following command:

In [5]:
!zcat PDB/pdb1bxw.ent.gz|head

HEADER    MEMBRANE PROTEIN                        03-OCT-98   1BXW              
TITLE     OUTER MEMBRANE PROTEIN A (OMPA) TRANSMEMBRANE DOMAIN                  
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: PROTEIN (OUTER MEMBRANE PROTEIN A);                        
COMPND   3 CHAIN: A;                                                            
COMPND   4 FRAGMENT: TRANSMEMBRANE DOMAIN;                                      
COMPND   5 ENGINEERED: YES;                                                     
COMPND   6 MUTATION: YES                                                        
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI BL21(DE3);                     

gzip: stdout: Broken pipe


The exclamation point at the beginning is to tell Jupyter this is a Unix command and to run it in the shell. `ls` we used above is so commonly used that Jupyter has been told to recognize it without needing the exclamation point.

Don't mind the `gzip: stdout: Broken pipe` at the end; zcat is meant to handle an entire file and so it causes 'Broken pipe' notice when it doesn't get to write all the file to the destination. (Also, if you ever see the `gzip` line somewhere other than the very end of the output, just run the cell again and it will probably move to the end where it should show.) The point is you can read the PDB file.

So that is the initial input? What did the script do?

The outout from the `PDBrenum.py` script gets saved over in `output_PDB/` because the PDB format was specified when calling the script. Using commands similar to when viewing the initial PDB files, the output can be viewed like so:

In [6]:
ls output_PDB

1bxw_renum.pdb.gz  1d5t_renum.pdb.gz  2vl3_renum.pdb.gz  5e6h_renum.pdb.gz


Alright, we can see the renumbered verison of the file we looked at earlier is `1bxw_renum.pdb.gz`. 

But how do we see the difference?

We can add in the Unix `tail` command to the 'pipe' the outout of our earlier `zcat` & `head` combination to show part of the middle of the original and renumbered files. 

First let's display a section of the original by running the command below:

In [7]:
!zcat PDB/pdb1bxw.ent.gz|head -n 510|tail


gzip: stdout: Broken pipe
ATOM     11  C   ALA A   1      46.036  12.651  40.029  1.00 51.14           C  
ATOM     12  O   ALA A   1      47.195  12.259  40.003  1.00 53.33           O  
ATOM     13  CB  ALA A   1      44.229  11.697  41.473  1.00 53.91           C  
ATOM     14  N   PRO A   2      45.736  13.936  40.024  1.00 49.63           N  
ATOM     15  CA  PRO A   2      46.822  14.919  40.021  1.00 51.58           C  
ATOM     16  C   PRO A   2      47.754  14.618  41.197  1.00 56.34           C  
ATOM     17  O   PRO A   2      47.328  14.035  42.194  1.00 54.38           O  
ATOM     18  CB  PRO A   2      46.081  16.238  40.142  1.00 45.83           C  
ATOM     19  CG  PRO A   2      44.708  15.943  39.588  1.00 44.84           C  
ATOM     20  CD  PRO A   2      44.381  14.536  40.054  1.00 41.42           C  


Now to display the renumbered version by running the command below:  
(the renumbered version gets an extra 14 lines in the header and so that is why `510` used in command above and `524` in command below)

In [8]:
!zcat output_PDB/1bxw_renum.pdb.gz|head -n 524|tail

ATOM     11  C   ALA A  22      46.036  12.651  40.029  1.00 51.14           C  
ATOM     12  O   ALA A  22      47.195  12.259  40.003  1.00 53.33           O  
ATOM     13  CB  ALA A  22      44.229  11.697  41.473  1.00 53.91           C  
ATOM     14  N   PRO A  23      45.736  13.936  40.024  1.00 49.63           N  
ATOM     15  CA  PRO A  23      46.822  14.919  40.021  1.00 51.58           C  
ATOM     16  C   PRO A  23      47.754  14.618  41.197  1.00 56.34           C  
ATOM     17  O   PRO A  23      47.328  14.035  42.194  1.00 54.38           O  
ATOM     18  CB  PRO A  23      46.081  16.238  40.142  1.00 45.83           C  
ATOM     19  CG  PRO A  23      44.708  15.943  39.588  1.00 44.84           C  
ATOM     20  CD  PRO A  23      44.381  14.536  40.054  1.00 41.42           C  

gzip: stdout: Broken pipe


Comparing the results of the two commands shows that what the original PDB has as residues `#1` and `#2` correspond to residues `#22` and `#23` in the UniProt numbering.  

By viewing [the corresponding UniProt entry](https://www.uniprot.org/uniprot/P0A910#sequences) (shown below for convenience), we can convince ourselves of the validity of this renumbering:
![](binder/start_of1bxw_at_uniprot.png)

This sample above shows that the numbering has been corrected in `1bxw_renum.pdb.gz` and the similarly processed PDB entries.



#### Locating the output for download

Above we showed how we can see the results listed from within this notebook and even display contents; however, if anything useful is created, you'll want to get those files out of the `output` directories and download them to your local computer. Jupyter has a file navigator accessible from the dashboard that allows you to download files from this session to your local machine. Click on the Jupyter icon in the upper left side above this notebook, next to 'demo'. That will take you to the Juptyer Dashboard. You should see the directory `output_PDB` listed there. Click on the word `output_PDB` and you should go into it where you can click the checkbox next to a file name and get a 'Download' button up at the top. Click 'Download' to initiate downloading the file to your local machine. 

#### Dealing with compression

The files that get used in running the `PDBrenum.py` get the gzip flavor of compression applied. At any point to convert them you can uncompress witht the `gunzip` Unix command. For example, to uncompress the above example output, use:

```text
!gunzip output_PDB/1bxw_renum.pdb.gz
```

After that you can view the file directly as text by either navigating to it in the file navigator and clicking on it to open it in the Jupyter Dashboard, or running the command below to view the first few lines of it directly:

```text
!head output_PDB/1bxw_renum.pdb
```

Substitute `cat` in place of `head` to display the entire file in this notebook.

### There's some good information that PDBrenum exposes as part of its process

It's important to point out that in the process PDBrenum exposes some information that can be useful in other contexts. Luckily, it makes that information available in a an easy to access form.  
The file generated during the process `log_corrected.txt` contains some useful information, such as mapping chain IDs for each PDB file to UniProt accession identifiers. The location of it in the directory where the `output_PDB/` and `PDB/` directories get generated was illustrated above where `ls` was run in the section 'Locating results and showing it worked' after PDBrenum was first run.

In [9]:
cat log_corrected.txt

SP PDB_id chain_PDB   chain_auth  UniProt             SwissProt              uni_len chain_len     renum 5k_or_50k
+  5e6h   A           A           P29375              KDM5A_HUMAN                294       294         0         0
+  1bxw   A           A           P0A910              OMPA_ECOLI                 171       172       171         1
+  2vl3   A           A           P30044              PRDX5_HUMAN                161       162       161         1
+  2vl3   B           B           P30044              PRDX5_HUMAN                161       161       161         0
+  2vl3   C           C           P30044              PRDX5_HUMAN                161       161       161         0
+  1d5t   A           A           P21856              GDIA_BOVIN                 431       433         0         2


Demonstrations of various ways of taking advantage of this information to map chains in PDB files to UniProt ids is found in a companion notebook, [Demo of using PDBrenum to perform mapping of chain IDs in PDB files to UniProt IDs](chainID_mapping_to_UniProt_id_demo.ipynb). That was originally suggested as an option to address this Biostars question: [Mapping PDB ID + chain ID to UniProt ID](https://www.biostars.org/p/9540519/#9540519). PDBrenum provides the necessary information, via the SIFTS database, parsed out as side product of its efforts and the information is in an easy to mine fixed width text-based data table.

---

### Using a list of PDB entry identifiers

You may have a lot of PDB entries that you want to process. The script allows for listing them in a separate text file with each id separated by a space and then indicating that file when calling the script. Such a file is included along with the script as `input.txt`. Let's examine the contents of that:

In [9]:
!head input.txt

2aa3 4zah 2aa2 2af2 2aac 2aaa 2asd


We can point the script at it when calling it, like so using the `-rftf` flag this time:

In [10]:
%run PDBrenum.py -rftf input.txt -PDB

Downloading PDB files: 100%|██████████| 7/7 [00:01<00:00,  3.98it/s]
Downloading SIFTS files: 100%|██████████| 7/7 [00:32<00:00,  4.63s/it]
Renumbering PDB files: 100%|██████████| 7/7 [04:23<00:00, 37.65s/it]


When that is finished, we can run the following cell to see that `output_PDB/` contains additional files that corresponds to the contents of `input.txt`.

In [11]:
ls output_PDB

1bxw_renum.pdb.gz  2aa3_renum.pdb.gz  2af2_renum.pdb.gz  4zah_renum.pdb.gz
1d5t_renum.pdb.gz  2aaa_renum.pdb.gz  2asd_renum.pdb.gz  5e6h_renum.pdb.gz
2aa2_renum.pdb.gz  2aac_renum.pdb.gz  2vl3_renum.pdb.gz


Using the means we used to analyze the contents of `1bxw_renum.pdb.gz` above, you could convince yourself those have been processed.

----

----