# 4-Flatmapping
This tutorial demonstrates how to split PDB structures into subcomponents or create biological assemblies. In Spark, a flatMap transformation splits each data record into zero or more records.

### Import pyspark and mmtfPyspark

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsDnaChain
from mmtfPyspark.mappers import StructureToBioassembly, StructureToPolymerChains, StructureToPolymerSequences
from mmtfPyspark.structureViewer import view_structure
from mmtfPyspark.utils import traverseStructureHierarchy
import py3Dmol

### Configure Spark

In [2]:
spark = SparkSession.builder.appName("4-Flatmapping").getOrCreate()

## Read PDB structures
In this example we download the hemoglobin structure 4HHB, consisting of two alpha subunits and two beta subunits.

In [3]:
quaternary = mmtfReader.download_reduced_mmtf_files(["4HHB"])

In [4]:
view_structure(quaternary.keys().collect());

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …

## Flatmap by protein sequence
Here we extract the polymer sequences using a flatMap transformation. Chains A and C (alpha subunits) and chains B and D (beta subunits) have identical sequences, respectively. 

In [5]:
sequences = quaternary.flatMap(StructureToPolymerSequences())
sequences.take(4)

[('4HHB.A',
 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR'),
 ('4HHB.B',
 'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH'),
 ('4HHB.C',
 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR'),
 ('4HHB.D',
 'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH')]

## Flatmap structures
A flatMap operation splits data records into zero or more records. Here, we use the StructureToPolymerChains class to flatMap a PDB entry (quaternary structure) to its polymer chains (tertiary structure). Note, the chain Id is appended to the PDB Id. The two alpha subunit are 4HHB.A and 4HHB.C and the beta subunits are 4HHB.B and 4HHB.C.

In [6]:
tertiary = quaternary.flatMap(StructureToPolymerChains())
tertiary.keys().collect()

['4HHB.A', '4HHB.B', '4HHB.C', '4HHB.D']

In [7]:
view_structure(tertiary.keys().collect());

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=3), Output()), …

For some analyses we may only need one copy of each unique subunit (identical polymer sequence). This can be done by setting excludeDuplicates = True.

In [8]:
tertiary = quaternary.flatMap(StructureToPolymerChains(excludeDuplicates=True))
tertiary.keys().collect()

['4HHB.A', '4HHB.B']

### Combine FlatMap with Filter
The filter operations we used previously for whole structures can also be applied to single polymer chains. Here we flatMap PDB structures into polymer chains and then select select DNA chains.

In [9]:
path = "../resources/mmtf_reduced_sample"

dna_chains = mmtfReader \
 .read_sequence_file(path) \
 .flatMap(StructureToPolymerChains(excludeDuplicates=True)) \
 .filter(ContainsDnaChain())

In [10]:
view_structure(dna_chains.keys().collect());

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=241), Output())…

## FlatMap PDB structures to Biological Assemblies

### Read the asymmetric unit
In this example we read the asymmetric unit of 1STP (Complex of Biotin with Streptavidin)

In [11]:
asymmetric_unit = mmtfReader.download_full_mmtf_files(["1STP"])

Print some summary data about this structure

In [12]:
traverseStructureHierarchy.print_structure_data(asymmetric_unit.first())

*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 3
Number of groups : 206
Number of atoms : 1001
Number of bonds : 940



### Create the biological assembly from the asymmetric unit
Now, we use a flatMap operation to map an asymmetric unit to one or more biological assemblies. In the case of 1STP, there is only one biological assembly, which represents a tetramer.

In [13]:
bio_assembly = asymmetric_unit.flatMap(StructureToBioassembly())

In [14]:
bio_assembly.first()[0]

'1STP-BioAssembly1'

As you can see, the biological assembly contains 4 copies of the asymmetric unit

In [15]:
traverseStructureHierarchy.print_structure_data(bio_assembly.first())

*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 12
Number of groups : 824
Number of atoms : 4004
Number of bonds : 3280



### Shown below is the bioassembly for 1STP (tetramer)

In [16]:
view_structure(["1STP"], bioAssembly=True);

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …

In [17]:
spark.stop()