# 1-Input
mmtf-pyspark operates on 3D structures in the compressed binary MMTF file format.

Info about MMTF:
* [Website](http://mmtf.rcsb.org/index.html)
* [Format paper](https://doi.org/10.1371/journal.pcbi.1005575)
* [Compression paper](https://doi.org/10.1371/journal.pone.0174846)
* [Specification](https://github.com/rcsb/mmtf/blob/master/spec.md)

Protein Data Bank structures are available in two MMTF data representations:
* full
 * All atom representation 
 * 0.001Å coordinate precision, 0.01 B-factor and occupancy precision
* reduced
 * C-alpha atoms only for polypeptides 
 * P-backbone atoms only for polynucleotides 
 * All atom representation for all other residue types 
 * 0.1Å coordinate precision, 0.1 B-factor and occupancy precision.

## Import pyspark and mmtfPyspark

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader

## Configure Spark

In [2]:
spark = SparkSession.builder.appName("1-Input").getOrCreate()
sc = spark.sparkContext

In [3]:
sc.defaultParallelism

4

## Download Structures
For a small list of PDB entries (10s to 100), the download methods are the quickest way to import structures. Here we download a list of 4 structure in the full representation.

In [4]:
pdbids = ['1LQ9','1LXJ','4XPX','1P1J']
structures = mmtfReader.download_full_mmtf_files(pdbids)

Structures are represented as keyword-value pairs (tuples):
* key: structure identifier (e.g., PDB ID)
* value: MmtfStructure (structure data)

We can print the keys and values using the collect() method. Note, that the structures are loaded in an arbritray order. You cannot rely on the order of structures.

In [5]:
structures.keys().collect()

['1P1J', '1LXJ', '4XPX', '1LQ9']

In [6]:
structures.values().collect()

[,
 ,
 ,
 ]

Spark represents these keyword-value pairs as Resilient Distributed Datasets (RDDs), which are a fault-tolerant collection of elements that can be operated on in parallel. To see how the dataset was distributed, we can print the number of partitions.

In [7]:
structures.getNumPartitions()

4

## Reading structures from an MMTF Hadoop Sequence File
Next, we read PDB structures from a local copy of an MMTF Hadoop Sequence file. For the following examples to work, the MMTF_FULL and MMTF_REDUCED environment variables need to be set. See installation instructions for details.

If you have long list (1000s) of PDB IDs, you can read the list of structures from a local copy of the MMTF Hadoop Sequence file,
however, it's very inefficent for a few structures, e.g, in the example below.

In [8]:
path = "../resources/mmtf_reduced_sample/"
structures = mmtfReader.read_sequence_file(path, pdbids)

Let's print the keys again and see how long this takes. You can see that Spark loads the data only when and if it's required.

In [9]:
structures.keys().collect()

['1LQ9', '1LXJ', '4XPX', '1P1J']

Now, let's read a sample of the PDB archive from the MMTF Hadoop Sequence file

In [10]:
structures = mmtfReader.read_sequence_file(path).cache()

#### There are 9756 structures in the sample file

In [11]:
%%time
structures.count()

CPU times: user 9.6 ms, sys: 3.85 ms, total: 13.4 ms
Wall time: 5.18 s


9756

### About data flow and caching in Spark
Now, let's count the number of structures again. Should this be faster this time since we already loaded the entire PDB? 

Not necessarily, the data from the Hadoop Sequence file are streamed through parallel threads. If you need the data again, they need to be reloaded from scratch, unless they are cached. See .cache() method call after reading the MMTF Hadoop Sequence file.

Remove the .cache() method call, run this notebook again and compare the time it takes to count the number of structures.

In [12]:
%%time
structures.count()

CPU times: user 7.45 ms, sys: 2.9 ms, total: 10.3 ms
Wall time: 900 ms


9756

## Reading the whole PDB from MMTF-Hadoop Sequence files

In this workshop we use a sample set of the PDB with about 10,000 structures.

To use the entire PDB, the MMTF_FULL and MMTF_REDUCED environment variables must to be set. See mmtf-pypark [installation instructions](https://github.com/sbl-sdsc/mmtf-pyspark#installation) for details.

### Read whole PDB in the full (all atom) representation
We commented this lines below, since we are using a smaller sample of the PDB for the tutorials. 

To use the whole PDB, the MMTF_FULL and MMTF_REDUCED environment variables need to be set to the `full` and `reduced` MMTF Hadoop Sequence file locations. See [installation instructions](https://github.com/sbl-sdsc/mmtf-pyspark#hadoop-sequence-files) for details.

In [13]:
# %%time
# pdb_full = mmtfReader.read_full_sequence_file();
# pdb_full.count()

### Read whole PDB in the reduced representation

In [14]:
# %%time
# pdb_reduced = mmtfReader.read_reduced_sequence_file();
# pdb_reduced.count()

# Very Important: Stop Spark!!!
It is very important to run the notebook all the way to the spark.stop() statement to terminate Spark. Otherwise you may end up running multiple instances of Spark that will interfere with each other.

In [15]:
spark.stop()