# Cookbook: Merging samples or Assemblies

ipyrad is designed to allow you to work with your samples in a very modular way. Meaning you can easily split your Assembly objects to group different samples into different runs that apply different parameter settings, and then you can merge them back together later. Similarly, with ipyrad you can easily merge multiple samples into a single sample, which may be useful when you include technical replicates in your sequencing lanes, or sequence an individual on multiple lanes. 

In [1]:
import ipyrad as ip

### Merging assemblies
This will combine all samples from the two assemblies into a single assembly. If some samples have the same name in both assemblies they will be merged into one sample in the merged assembly.

In [2]:
# demux first lane of data
data1 = ip.Assembly("data1")
data1.params.raw_fastq_path = "./ipsimdata/pairddrad_example_R*.gz"
data1.params.barcodes_path = "./ipsimdata/pairddrad_example_barcodes.txt"
data1.params.assembly_method = "reference"
data1.params.reference_sequence = "./ipsimdata/pairddrad_example_genome.fa"
data1.params.datatype = "pairddrad"
data1.run('1', force=True, launch_client=True)

# demux second lane of data
data2 = ip.Assembly("data2")
data2.params.raw_fastq_path = "./ipsimdata/pairddrad_example_R*.gz"
data2.params.barcodes_path = "./ipsimdata/pairddrad_example_barcodes.txt"
data2.params.assembly_method = "reference"
data2.params.reference_sequence = "./ipsimdata/pairddrad_example_genome.fa"
data2.params.datatype = "pairddrad"
data2.run('1', force=True, launch_client=True)

# merge assemblies
mdata = ip.merge("alldata", [data1, data2])
mdata.run("234567", force=True, launch_client=True)

New Assembly: data1
establishing parallel connection:
Assembly: data1
[####################] 100% 0:00:06 | sorting reads        | s1 |
[####################] 100% 0:00:01 | writing/compressing  | s1 |
New Assembly: data2
establishing parallel connection:
Assembly: data2
[####################] 100% 0:00:05 | sorting reads        | s1 |
[####################] 100% 0:00:01 | writing/compressing  | s1 |
New Assembly: alldata
establishing parallel connection:
Assembly: alldata
[####################] 100% 0:00:01 | concatenating inputs | s2 |
[####################] 100% 0:00:08 | processing reads     | s2 |
[####################] 100% 0:00:00 | indexing reference   | s3 |
[####################] 100% 0:00:00 | concatenating        | s3 |
[####################] 100% 0:00:03 | join unmerged pairs  | s3 |
[####################] 100% 0:00:03 | dereplicating        | s3 |
[####################] 100% 0:00:00 | splitting dereps     | s3 |
[####################] 100% 0:00:04 | mapping reads        |

### Merge samples
Sometimes you may wish to merge samples within an Assembly. This can be done like below. 

In [2]:
# demux first lane of data
data1 = ip.Assembly("data1")
data1.params.raw_fastq_path = "./ipsimdata/rad_example_R1_.fastq.gz"
data1.params.barcodes_path = "./ipsimdata/rad_example_barcodes.txt"
data1.run('1', force=True, launch_client=True)

New Assembly: data1
establishing parallel connection:
Assembly: data1
[####################] 100% 0:00:04 | sorting reads        | s1 |
[####################] 100% 0:00:01 | writing/compressing  | s1 |


In [3]:
# the rename dictionary is applied during merge
mergedict = {
    "1A_0": "A", 
    "1B_0": "A", 
    "1C_0": "A",
    "1D_0": "D",
}
merged = ip.merge("merged", data1, rename_dict=mergedict)
merged.run("234567", force=True, launch_client=True)
merged.stats

New Assembly: merged
establishing parallel connection:
Assembly: merged
[####################] 100% 0:00:00 | concatenating inputs | s2 |
[####################] 100% 0:00:03 | processing reads     | s2 |
[####################] 100% 0:00:00 | concatenating        | s3 |
[####################] 100% 0:00:02 | dereplicating        | s3 |
[####################] 100% 0:00:01 | clustering/mapping   | s3 |
[####################] 100% 0:00:00 | building clusters    | s3 |
[####################] 100% 0:00:00 | chunking clusters    | s3 |
[####################] 100% 0:00:07 | aligning clusters    | s3 |
[####################] 100% 0:00:00 | concat clusters      | s3 |
[####################] 100% 0:00:00 | calc cluster stats   | s3 |
[####################] 100% 0:00:03 | inferring [H, E]     | s4 |
[####################] 100% 0:00:00 | calculating depths   | s5 |
[####################] 100% 0:00:00 | chunking clusters    | s5 |
[####################] 100% 0:00:10 | consens calling      | s5 |
[###

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
2E_0,6,20017,20017,1000,1000,0.00183,0.000766,1000
2F_0,6,19933,19933,1000,1000,0.001996,0.000755,1000
2G_0,6,20030,20030,1000,1000,0.00194,0.000763,1000
2H_0,6,20199,20199,1000,1000,0.001747,0.000756,1000
3I_0,6,19885,19885,1000,1000,0.001807,0.000758,1000
3J_0,6,19822,19822,1000,1000,0.001931,0.000776,1000
3K_0,6,19965,19965,1000,1000,0.002092,0.000766,1000
3L_0,6,20008,20008,1000,1000,0.002042,0.000748,1000
A,6,60041,60041,1000,1000,0.009826,0.000833,1000
D,6,19966,19966,1000,1000,0.001803,0.000761,1000


In [104]:
from ipyrad.assemble.write_outputs import *
import ipyparallel as ipp
ipyclient = ipp.Client()
step = Step7(merged, True, ipyclient)

In [105]:
step.split_clusters()
step.remote_process_chunks()
step.collect_stats()
step.store_file_handles()
step.remote_build_arrays_and_write_loci()
step.remote_write_outfiles()

[####################] 100% 0:00:03 | applying filters     | s7 |
[####################] 100% 0:00:01 | building arrays      | s7 |
[####################] 100% 0:00:00 | writing conversions  | s7 |


In [106]:
step.remote_fill_depths()

[####################] 100% 0:00:00 | indexing vcf depths  | s7 |


In [107]:
step.remote_build_vcf()

[####################] 100% 0:00:01 | writing vcf output   | s7 |


In [91]:
filler.snpsmap[filler.snpsmap[:, 0] == 2]

array([[ 2,  1],
       [ 2, 74],
       [ 2, 84]], dtype=uint32)

In [102]:
#step.remote_fill_depths()

sample = step.data.samples["D"]
filler = VCF_filler(step.data, step.nsnps, sample)

# run function loop
for idx in range(len(filler.locbits)):
    filler.localidx = 0
    
    # locfill function
    edges = np.load(filler.trimbits[idx])
    inds = filler.indbits[idx]
    filler.loclines = iter(open(filler.locbits[idx], 'r'))
    
    while 1:
        try:
            filler.yield_loc()
        except StopIteration:
            break

        filler.locsnps = filler.snpsmap[filler.snpsmap[:, 0] == filler.locidx]
        filler.gtrim = edges[filler.localidx - 1]
        
        # denovo enter catgs func
        if (filler.locsnps.size) and (filler.sname in filler.names):
            nidx = filler.names.index(filler.sname)
            sidx = filler.sidxs[nidx]
            tups = [[int(j) for j in i.split("-")] for i in sidx.split(":")]
            seq = filler.seqs[nidx]
            seqarr = np.array(list(seq))

            print('idx', idx, 'locidx', filler.locidx - 1)

            for snp in filler.locsnps[:, 1]:
                print(snp, seq[snp])
                ishift = seq[:snp].count('-')
                for tup in tups:
                    cidx, coffset = tup
                    pos = snp + (coffset) - ishift            
                    print(pos, (snp, filler.gtrim, coffset, ishift))
                    print(filler.catgs[cidx, pos])
                    if (pos >= 0) & (pos < filler.maxlen):
                        filler.vcfd[filler.snpidx] += filler.catgs[cidx, pos]

                filler.snpidx += 1
            print(seq)
            print("")
        
        else:
            print("NOT HERE")
            filler.snpidx += filler.locsnps.shape[0]

idx 0 locidx 0
10 G
15 (10, 5, 5, 0)
[ 0  0  0 22]
ACTCTCAGGCGTACGATTCGGTACAAGTGTTTTGCTCGTTGGTTGCTTTGACAAACGACGGTATAGCAACGACGAATCATAAGTGT

idx 0 locidx 1
1 G
6 (1, 5, 5, 0)
[ 0  0  0 17]
74 T
79 (74, 5, 5, 0)
[ 0  0 17  0]
84 G
89 (84, 5, 5, 0)
[ 0  0  0 17]
GGATTTGGGTCTGGACGGAGCCAGTTAGTTAAAGTCCGTGCCCCCACGGAATAACGGTAAAATATACAAAACATTAAAGCACTTGG

idx 0 locidx 2
21 T
26 (21, 5, 5, 0)
[ 0  0 18  0]
47 C
52 (47, 5, 5, 0)
[18  0  0  0]
63 T
68 (63, 5, 5, 0)
[ 0  0 18  0]
ATAAGGGTCTCTTCTAGCAAATCGGCGTGCGCTGATCCTGGAATGTCCGGTTGAAGATATGATTACAGTACAGCGTGCGGCGCTAA

idx 0 locidx 3
13 C
18 (13, 5, 5, 0)
[19  0  0  0]
18 G
23 (18, 5, 5, 0)
[ 0  0  0 19]
42 G
47 (42, 5, 5, 0)
[ 0  0  0 19]
47 G
52 (47, 5, 5, 0)
[ 0  0  0 19]
54 C
59 (54, 5, 5, 0)
[19  0  0  0]
60 A
65 (60, 5, 5, 0)
[ 0 19  0  0]
TTACTAAGTGGTTCAAGTGGTTACTGAAGGGATAAGAGACGGGGTCGGAGGCCTCCTGAAATGGGCGGAAAGCTCTCTAACAGGAA

idx 0 locidx 4
42 G
47 (42, 5, 5, 0)
[ 0  0  0 15]
46 A
51 (46, 5, 5, 0)
[ 0 15  0  0]
55 A
60 (55, 5, 5, 0)
[ 0 15  0  0]


idx 0 locidx 41
1 A
6 (1, 5, 5, 0)
[ 0 18  0  0]
2 T
7 (2, 5, 5, 0)
[ 0  0 18  0]
42 T
47 (42, 5, 5, 0)
[ 0  0 18  0]
59 A
64 (59, 5, 5, 0)
[ 0 18  0  0]
60 C
65 (60, 5, 5, 0)
[18  0  0  0]
71 T
76 (71, 5, 5, 0)
[ 0  0 18  0]
81 A
86 (81, 5, 5, 0)
[ 0 18  0  0]
GATAGATCGGAAAGGAGATGGAGTGGGGCTGTAGAAGGCCCCTAATTGCATAGCTAAGAACAGTTGAAGCATTCATCTACTAAGGT

idx 0 locidx 42
13 C
18 (13, 5, 5, 0)
[17  0  0  0]
16 A
21 (16, 5, 5, 0)
[ 0 17  0  0]
23 A
28 (23, 5, 5, 0)
[ 0 17  0  0]
61 T
66 (61, 5, 5, 0)
[ 0  0 17  0]
73 C
78 (73, 5, 5, 0)
[17  0  0  0]
TTGATTAGAAGAGCATACAGCTCAAGTCCAATGTAGGATGGGGCGTATTAGATGCTACTCCTCCATACACGATCTCTATTGAAGGT

idx 0 locidx 43
3 M
8 (3, 5, 5, 0)
[9 7 0 0]
8 G
13 (8, 5, 5, 0)
[ 0  0  0 16]
17 T
22 (17, 5, 5, 0)
[ 0  0 16  0]
19 G
24 (19, 5, 5, 0)
[ 0  0  0 16]
32 T
37 (32, 5, 5, 0)
[ 0  0 16  0]
43 A
48 (43, 5, 5, 0)
[ 0 16  0  0]
54 T
59 (54, 5, 5, 0)
[ 0  0 16  0]
59 T
64 (59, 5, 5, 0)
[ 0  0 16  0]
CATMAGTAGCTCGAGCCTGGGGAGGAACCTATTCGGGGGTCTCACCGCCTAACCTGCTCTGATTCAAGAAG

25 G
30 (25, 5, 5, 0)
[ 0  0  0 19]
31 A
36 (31, 5, 5, 0)
[ 0 19  0  0]
64 G
69 (64, 5, 5, 0)
[ 0  0  0 19]
73 T
78 (73, 5, 5, 0)
[ 0  0 19  0]
85 G
90 (85, 5, 5, 0)
[ 0  0  0 19]
CCCGGCCTGCACCCCGTTTCGTGTCGTCCATAGCATAAATCTCGGCATCGCGCGCCTGAAAGTAGCGGCTTGCTTGAACTTTACCG

idx 2 locidx 105
4 A
9 (4, 5, 5, 0)
[ 0 19  0  0]
15 C
20 (15, 5, 5, 0)
[19  0  0  0]
38 C
43 (38, 5, 5, 0)
[19  0  0  0]
AAAAAGCTACGCACTCGGACACCCCCTTTAACGAAATTCCGAACTCGAACGTCCGGGTCGGAACTCAAGAATCGGCGTTAGTTACT

idx 2 locidx 106
14 G
19 (14, 5, 5, 0)
[ 0  0  0 19]
23 T
28 (23, 5, 5, 0)
[ 0  0 19  0]
34 C
39 (34, 5, 5, 0)
[19  0  0  0]
84 C
89 (84, 5, 5, 0)
[19  0  0  0]
GCTTATAACAGCTGGGGACAACTTATCGCTCTTACAATAGGCGAATCAGTGTTATTACGTGGTAACGAGCTAATCATTCATCTCCC

idx 2 locidx 107
16 T
21 (16, 5, 5, 0)
[ 0  0 18  0]
45 T
50 (45, 5, 5, 0)
[ 0  0 18  0]
82 W
87 (82, 5, 5, 0)
[ 0 12  6  0]
TTGTCCATTTTCAAGTTAGTTAGCGAAAGACCAGTGTACTTACACTTGCCCCGGCCCGACCGACTCTAACTCGATGTAGGAGWTCT

idx 2 locidx 108
22 T
27 (22, 5, 5, 0)
[ 0  0 21  0]
34 C
39

idx 3 locidx 156
36 A
41 (36, 5, 5, 0)
[ 0 23  0  0]
41 G
46 (41, 5, 5, 0)
[ 0  0  0 23]
43 C
48 (43, 5, 5, 0)
[23  0  0  0]
73 A
78 (73, 5, 5, 0)
[ 0 23  0  0]
TGCACTGCGAGGGCTAGGGGACTGCAACAAGCCGAAAGGGTGACGTGCCTCTCCGTCGGGTGCGCTGTGTCGAACCATCCCTTGTT

idx 3 locidx 157
5 T
10 (5, 5, 5, 0)
[ 0  0 20  0]
10 A
15 (10, 5, 5, 0)
[ 0 20  0  0]
46 T
51 (46, 5, 5, 0)
[ 0  0 20  0]
80 T
85 (80, 5, 5, 0)
[ 0  0 20  0]
82 C
87 (82, 5, 5, 0)
[20  0  0  0]
CCCAGTCAGTATGCCTATCGGGCATTCGGGCCCGCGTGCACTCCACTGACGCCGATAGTGCTACAGGCGCCGCTATTGAATCCCGG

idx 3 locidx 158
22 A
27 (22, 5, 5, 0)
[ 0 21  0  0]
33 G
38 (33, 5, 5, 0)
[ 0  0  0 21]
35 G
40 (35, 5, 5, 0)
[ 0  0  0 21]
36 T
41 (36, 5, 5, 0)
[ 0  0 21  0]
79 C
84 (79, 5, 5, 0)
[21  0  0  0]
TAAAGATTCACTTCCACGAGTCATGTTACTAAAGCGTTAAAGCGTAAAGCTTATATCTCTCTTGCATTTGGTATTACATCCTGCGA

idx 3 locidx 159
14 T
19 (14, 5, 5, 0)
[ 0  0 17  0]
51 S
56 (51, 5, 5, 0)
[9 0 0 8]
63 A
68 (63, 5, 5, 0)
[ 0 17  0  0]
75 A
80 (75, 5, 5, 0)
[ 0 17  0  0]
CGACCCGCCTCCTGTGAATCGCATGG

17 C
22 (17, 5, 5, 0)
[19  0  0  0]
36 A
41 (36, 5, 5, 0)
[ 0 19  0  0]
39 T
44 (39, 5, 5, 0)
[ 0  0 19  0]
43 A
48 (43, 5, 5, 0)
[ 0 19  0  0]
79 A
84 (79, 5, 5, 0)
[ 0 19  0  0]
85 T
90 (85, 5, 5, 0)
[ 0  0 19  0]
CTCTCTGTTGGCGACGGCTAATTGATCTCAGCACTCAGTTTAGAAACACACCTTCGTGTGAGGGCTGGCCGTCCCTTACAAAGCGT

idx 4 locidx 210
2 A
7 (2, 5, 5, 0)
[ 0 19  0  0]
11 A
16 (11, 5, 5, 0)
[ 0 19  0  0]
57 A
62 (57, 5, 5, 0)
[ 0 19  0  0]
76 C
81 (76, 5, 5, 0)
[19  0  0  0]
GGATTAAGGCAATCCGAGACCACTCACCGAGACTCTTACATTCTGCGCCCGACGCTGAGGGGCTGATATGGGTGGGCTATGTTCTG

idx 4 locidx 211
1 T
6 (1, 5, 5, 0)
[ 0  0 15  0]
23 A
28 (23, 5, 5, 0)
[ 0 15  0  0]
39 C
44 (39, 5, 5, 0)
[15  0  0  0]
54 G
59 (54, 5, 5, 0)
[ 0  0  0 15]
74 A
79 (74, 5, 5, 0)
[ 0 15  0  0]
76 G
81 (76, 5, 5, 0)
[ 0  0  0 15]
79 C
84 (79, 5, 5, 0)
[15  0  0  0]
CTTGTCCAAAAGTTAGAATCGTGACAGGCCCCCAGTGGTCTGTGATACATTAGTGACACGTAATGAAGATATCAATGCGCATGGTG

NOT HERE
idx 4 locidx 213
1 A
6 (1, 5, 5, 0)
[ 0 21  0  0]
6 T
11 (6, 5, 5, 0)
[ 0  0 21  0]
28 

43 (38, 5, 5, 0)
[ 0  0 25  0]
43 A
48 (43, 5, 5, 0)
[ 0 25  0  0]
44 T
49 (44, 5, 5, 0)
[ 0  0 25  0]
63 A
68 (63, 5, 5, 0)
[ 0 25  0  0]
79 T
84 (79, 5, 5, 0)
[ 0  0 25  0]
GTCCTGATGTGCCCCCACTGAATACGTCTAGTAGCAATTAGGAATGAGTGGGTTTGACATATTAGTAGCACAAAGTCCCTACCTAG

idx 5 locidx 266
9 T
14 (9, 5, 5, 0)
[ 0  0 22  0]
18 A
23 (18, 5, 5, 0)
[ 0 22  0  0]
37 C
42 (37, 5, 5, 0)
[22  0  0  0]
62 G
67 (62, 5, 5, 0)
[ 0  0  0 22]
CAGTTCTATTCTAAGAGGACCGTTCTCTGGGAAGAATCCGGACGCGTGACACACGTCACTCAGCAGTCGTGCTTGTTTATTAGACC

idx 5 locidx 267
16 C
21 (16, 5, 5, 0)
[20  0  0  0]
24 G
29 (24, 5, 5, 0)
[ 0  0  0 20]
32 C
37 (32, 5, 5, 0)
[20  0  0  0]
46 G
51 (46, 5, 5, 0)
[ 0  0  0 20]
49 A
54 (49, 5, 5, 0)
[ 0 20  0  0]
62 A
67 (62, 5, 5, 0)
[ 0 20  0  0]
84 T
89 (84, 5, 5, 0)
[ 0  0 20  0]
CATTAGGGCAATTTCGCGACAGTCGGAAGCTACGCCGGGACAACGAGTCAAAAATACCGCGAACGCTTCAGGATGATATTTAGTTG

idx 5 locidx 268
3 T
8 (3, 5, 5, 0)
[ 0  0 17  0]
52 A
57 (52, 5, 5, 0)
[ 0 17  0  0]
74 A
79 (74, 5, 5, 0)
[ 0 17  0  0]
CCGTAGGGAAT

[ 0  0 17  0]
43 A
48 (43, 5, 5, 0)
[ 0 17  0  0]
69 A
74 (69, 5, 5, 0)
[ 0 17  0  0]
CGGGGCCTCAAAGGGTACGCAGTCTGTATCCATCACTAACCACACCTGAGATCGGAAGAGGTGTTGTGAACTATGGGACCTCAGAG

idx 6 locidx 326
23 C
28 (23, 5, 5, 0)
[23  0  0  0]
40 T
45 (40, 5, 5, 0)
[ 0  0 23  0]
56 T
61 (56, 5, 5, 0)
[ 0  0 23  0]
GACCCCTATATACAACCTGCAGACAATTCAATCGGGGGTATCAACTAATAATGGGATGGGGTGCAGTATAAAGTCCAACGACGGCA

idx 6 locidx 327
29 G
34 (29, 5, 5, 0)
[ 0  0  0 17]
50 A
55 (50, 5, 5, 0)
[ 0 17  0  0]
53 G
58 (53, 5, 5, 0)
[ 0  0  0 17]
63 C
68 (63, 5, 5, 0)
[17  0  0  0]
68 T
73 (68, 5, 5, 0)
[ 0  0 17  0]
75 A
80 (75, 5, 5, 0)
[ 0 17  0  0]
CTTGTGAATCTCTTTAGCCTAGTCGTAAAGAAATAATTCGAAAAAAACCAACAGTCTTGTCGTCGGCTTCTTGATATCTTCCAACG

idx 6 locidx 328
5 G
10 (5, 5, 5, 0)
[ 0  0  0 14]
34 C
39 (34, 5, 5, 0)
[14  0  0  0]
39 C
44 (39, 5, 5, 0)
[14  0  0  0]
72 A
77 (72, 5, 5, 0)
[ 0 14  0  0]
TACAAGCTGGCCACGCTTCAGATAATGCCAATGCCCTTTCTAGCAAGTAGAGCTCTACTGGGCTTTAGATGGAATTTGTTTAGTAT

idx 6 locidx 329
15 T
20 (15, 5, 5, 0)
[ 0  0

[ 0  0 29  0]
14 T
19 (14, 5, 5, 0)
[ 0  0 29  0]
ATTGAACGGGTATCTAGACCGAGTTAGTTTCATCAGTGGGAAATATGCGCTAGGTCGGCGTTGCCATGGATTAGACGCGACGAAGT

idx 7 locidx 379
9 G
14 (9, 5, 5, 0)
[ 0  0  0 22]
38 A
43 (38, 5, 5, 0)
[ 0 22  0  0]
45 T
50 (45, 5, 5, 0)
[ 0  0 22  0]
46 G
51 (46, 5, 5, 0)
[ 0  0  0 22]
67 C
72 (67, 5, 5, 0)
[22  0  0  0]
AAATGGCCGGAGCGCTTATCTCTCGATATCAAATCGGGATTCCGGTGAAAAACTGTCAGGTTACCGTCCGGGCCCGCTTTCAGTCT

idx 7 locidx 380
51 G
56 (51, 5, 5, 0)
[ 0  0  0 18]
61 M
66 (61, 5, 5, 0)
[ 8 10  0  0]
84 A
89 (84, 5, 5, 0)
[ 0 18  0  0]
CCTACAGCTCTAAAGCCATCGGATGACTCTTAGGGTGCGCGATACGGTTACGCAAGTGGACMAATGATCCACGTATCATTCCCTAA

idx 7 locidx 381
15 G
20 (15, 5, 5, 0)
[ 0  0  0 18]
18 A
23 (18, 5, 5, 0)
[ 0 18  0  0]
71 C
76 (71, 5, 5, 0)
[18  0  0  0]
72 A
77 (72, 5, 5, 0)
[ 0 18  0  0]
77 T
82 (77, 5, 5, 0)
[ 0  0 18  0]
TAATTGTTAAGACTTGGAAGCGTAGGGAGACACCTCCCTAGTGGAAGGTGGCGCTACCTCAGGAGCCATATCAGGCATGGGTTGCG

idx 7 locidx 382
7 T
12 (7, 5, 5, 0)
[ 0  0 17  0]
26 G
31 (26, 5, 5, 0)
[ 0  0  

idx 8 locidx 448
17 C
22 (17, 5, 5, 0)
[20  0  0  0]
19 C
24 (19, 5, 5, 0)
[20  0  0  0]
26 C
31 (26, 5, 5, 0)
[20  0  0  0]
CAATGTCCGCGAGGACCCCCCCTAAACGACTCCAAGTTGTGAGCCCCAACCTCCTTCTTGCACGATAGAGTTCGCATGTCTCCTAG

idx 8 locidx 449
8 T
13 (8, 5, 5, 0)
[ 0  0 16  0]
21 A
26 (21, 5, 5, 0)
[ 0 16  0  0]
43 T
48 (43, 5, 5, 0)
[ 0  0 16  0]
61 G
66 (61, 5, 5, 0)
[ 0  0  0 16]
69 G
74 (69, 5, 5, 0)
[ 0  0  0 16]
72 G
77 (72, 5, 5, 0)
[ 0  0  0 16]
76 A
81 (76, 5, 5, 0)
[ 0 16  0  0]
80 A
85 (80, 5, 5, 0)
[ 0 16  0  0]
AACGCCATTTTTGATGGGACAACACTGGCGGGCGTTGATTGGATTCTCGTTTAACGTGAAGGGAAGTGAGGGGACGAGTCAAAGGA

idx 9 locidx 450
34 G
39 (34, 5, 5, 0)
[ 0  0  0 20]
64 A
69 (64, 5, 5, 0)
[ 0 20  0  0]
GAAACTGTGTCCGTCGAGAGGTTGATCGTTCTTCGAAGTTATTTTAAATCAGGCGCCATAGCCTAATCCAAGGCGGTGATGCAGAT

idx 9 locidx 451
56 A
61 (56, 5, 5, 0)
[ 0 24  0  0]
70 C
75 (70, 5, 5, 0)
[23  1  0  0]
GGACCAGCTAACGGAACAATGGTAGGTTGCTAAGATGAGGGTCAGCCCTACAAAAGAATCTTAATTAGTCCGCAGCATCCCTAAAT

idx 9 locidx 452
5 T
10 (5, 5, 5, 0)
[ 0  

53 T
58 (53, 5, 5, 0)
[ 0  0 20  0]
GGATATACGGCGTTAATTAATCAGAATTCATGTGGCGACAGACTTGCTATGGCTGCGCAAGACACAACGTTTATCGGGGACTGCAG

idx 10 locidx 522
9 T
14 (9, 5, 5, 0)
[ 0  0 17  0]
20 T
25 (20, 5, 5, 0)
[ 0  0 17  0]
55 A
60 (55, 5, 5, 0)
[ 0 17  0  0]
74 G
79 (74, 5, 5, 0)
[ 0  0  0 17]
CGCCTAGACTTGCGGCAATCTGGCCGGTGCCCTACACTCAAAATACGTCCCTCCCATTGTTGTCCCGATAGTCCGGGTCTTTCGAG

idx 10 locidx 523
7 G
12 (7, 5, 5, 0)
[ 0  0  0 16]
9 A
14 (9, 5, 5, 0)
[ 0 16  0  0]
19 G
24 (19, 5, 5, 0)
[ 0  0  0 16]
58 G
63 (58, 5, 5, 0)
[ 0  0  0 16]
81 T
86 (81, 5, 5, 0)
[ 0  0 16  0]
TTAGGCCGAACTATGCTACGTTTCTGGGTGGGCTCTAAAAAAGGTCTTACCGATGTACGTTGCTTCCGCACCGCTGCAGACTCGCT

idx 10 locidx 524
67 T
72 (67, 5, 5, 0)
[ 0  0 14  0]
85 A
90 (85, 5, 5, 0)
[ 0 14  0  0]
GGGGAGCGCCCTTTTCTCTGACTAGGTCTGCCAAAATCTGTTAATAATCGATCGATCTACCCATACGTAGAGGGATGGGTGGTTAA

idx 10 locidx 525
30 A
35 (30, 5, 5, 0)
[ 0 19  0  0]
44 T
49 (44, 5, 5, 0)
[ 0  0 19  0]
51 G
56 (51, 5, 5, 0)
[ 0  0  0 19]
TCTATTCTAGTGCATCAAACAGAGTTCGGGACAGTGAGCGAG

30 T
35 (30, 5, 5, 0)
[ 0  0 21  0]
35 T
40 (35, 5, 5, 0)
[ 0  0 21  0]
57 C
62 (57, 5, 5, 0)
[21  0  0  0]
76 T
81 (76, 5, 5, 0)
[ 0  0 21  0]
CTAAGCGAGAAGGGGCAATGATAGTGCCACTTTAATCAAAGGGCTTCATTTAGGGTCCGACACAGCCCTGATTTACTGAGATTAGC

idx 12 locidx 616
24 C
29 (24, 5, 5, 0)
[20  0  0  0]
41 C
46 (41, 5, 5, 0)
[20  0  0  0]
60 T
65 (60, 5, 5, 0)
[ 0  0 20  0]
65 A
70 (65, 5, 5, 0)
[ 0 20  0  0]
70 G
75 (70, 5, 5, 0)
[ 0  0  0 20]
77 C
82 (77, 5, 5, 0)
[20  0  0  0]
CCCTCCCTGAGAGCGATGGCGCCTCTAACCTAGCTGAAGGCCGCTCAGCTCAAGGGGTGTTGTTAAAGCAGAGCTTGCATGGTTCT

idx 12 locidx 617
19 A
24 (19, 5, 5, 0)
[ 0 20  0  0]
53 T
58 (53, 5, 5, 0)
[ 0  0 20  0]
60 G
65 (60, 5, 5, 0)
[ 0  0  0 20]
61 G
66 (61, 5, 5, 0)
[ 0  0  0 20]
74 C
79 (74, 5, 5, 0)
[20  0  0  0]
81 C
86 (81, 5, 5, 0)
[20  0  0  0]
CTCCCAGGTCATTTTGAGGAGAGAGGTACCCAAACTGGCGTAGTTGTAATCTGTCACGCCGGCATTTACGGTGTCCCAGATCTGAC

idx 12 locidx 618
11 G
16 (11, 5, 5, 0)
[ 0  0  0 18]
33 C
38 (33, 5, 5, 0)
[18  0  0  0]
83 A
88 (83, 5, 5, 0)
[ 0 18  0  0

[0 5 7 0]
84 T
89 (84, 5, 5, 0)
[ 0  0 12  0]
CTTCTGAAGCCTTAAATCCTCACGTCAACATGATGCCTWCATGAATCATATACTGTTTATATTATCCTTATACACAAAAGAGCCTA

idx 13 locidx 676
2 C
7 (2, 5, 5, 0)
[22  0  0  0]
19 C
24 (19, 5, 5, 0)
[22  0  0  0]
44 T
49 (44, 5, 5, 0)
[ 0  0 22  0]
54 A
59 (54, 5, 5, 0)
[ 0 22  0  0]
80 T
85 (80, 5, 5, 0)
[ 0  0 22  0]
AACGTCAAGGACCTAATTGCGGGCAATGGACCCGTGAACTCACGTTTCTGCTTGAGCAATAGCAGGTGTAATTTGTAGTCTAAATA

idx 13 locidx 677
72 C
77 (72, 5, 5, 0)
[20  0  0  0]
CACACCGCGCACACGCCGACGCTTCCACGATAAGCCCACTTACCTGTGCACGACCTGTGTATGGGCGCGGATCAAGCGCCGCACTT

idx 13 locidx 678
7 A
12 (7, 5, 5, 0)
[ 0 21  0  0]
84 A
89 (84, 5, 5, 0)
[ 0 21  0  0]
TACAAGAAAGCGAGATCTCGACTTTACGGATTGCGCCGTCTAGAGGATTGGTCATTGTTTGACAGATCTGTGAGACAGATACTGAC

idx 13 locidx 679
20 T
25 (20, 5, 5, 0)
[ 0  0 19  0]
39 T
44 (39, 5, 5, 0)
[ 0  0 19  0]
46 A
51 (46, 5, 5, 0)
[ 0 19  0  0]
GGGGTCGTCACTACAAAACGTAAACATGTAACATCGGGGTAGTGATATCTTAGCCCGCTCATAATTGTAACCGGTTCTATACTGGC

idx 13 locidx 680
28 C
33 (28, 5, 5, 0)
[24  0  0  

idx 14 locidx 711
4 T
9 (4, 5, 5, 0)
[ 0  0 23  0]
17 C
22 (17, 5, 5, 0)
[23  0  0  0]
50 A
55 (50, 5, 5, 0)
[ 0 23  0  0]
GCGTTCAAGAAGATTAACCTTAAGCATGCAACACGCACGCACGAAACCGAATGGTCAGTTGGCCCTAAATGCGCTTAGTTACGACC

idx 14 locidx 712
0 C
5 (0, 5, 5, 0)
[21  0  0  0]
2 A
7 (2, 5, 5, 0)
[ 0 21  0  0]
8 C
13 (8, 5, 5, 0)
[21  0  0  0]
26 C
31 (26, 5, 5, 0)
[21  0  0  0]
34 G
39 (34, 5, 5, 0)
[ 0  0  0 21]
38 C
43 (38, 5, 5, 0)
[21  0  0  0]
53 C
58 (53, 5, 5, 0)
[21  0  0  0]
CAAGATCACGGCGGACAGAACCGCCCCTTTTCTTGTTGCTGGTTAACTTCACGCCGTCATGGTTAGTGGTCAGGCTTTACAGGTCC

idx 14 locidx 713
2 G
7 (2, 5, 5, 0)
[ 0  0  0 19]
10 T
15 (10, 5, 5, 0)
[ 0  0 19  0]
22 S
27 (22, 5, 5, 0)
[10  0  0  9]
24 G
29 (24, 5, 5, 0)
[ 0  0  0 19]
44 A
49 (44, 5, 5, 0)
[ 0 19  0  0]
48 w
53 (48, 5, 5, 0)
[ 0  9 10  0]
TAGATCGTGGTTGGCGCGGCAASTGGTACGACGGATACCCAGGCAGGGwTTTACAGCGTGCTGTCATCGCCAGAGGTTGGGAGCAT

idx 14 locidx 714
33 C
38 (33, 5, 5, 0)
[23  0  0  0]
55 C
60 (55, 5, 5, 0)
[23  0  0  0]
GAGCCACCGGATAAAGGCTGTATACTACCG

idx 14 locidx 749
44 T
49 (44, 5, 5, 0)
[ 0  0 22  0]
76 C
81 (76, 5, 5, 0)
[22  0  0  0]
78 T
83 (78, 5, 5, 0)
[ 0  1 21  0]
GTTTCTGCGAGAAACAGCTCGTATCACGACCCTTTGGCTGCCGGTTCTTAGCATGCAATATGTGGGCATAACTTCTCCTACCTTCT

idx 15 locidx 750
19 G
24 (19, 5, 5, 0)
[ 0  0  0 22]
71 T
76 (71, 5, 5, 0)
[ 0  0 22  0]
GCCATACACTAGTCCAAAAGGGTGTGAACCAGTGCCTATAAACAGCATGTCTGATATTTTGGAAGCTTTCTTAAAGGCAATAACTA

idx 15 locidx 751
1 C
6 (1, 5, 5, 0)
[21  1  0  0]
7 S
12 (7, 5, 5, 0)
[14  0  0  8]
GCACCACSGAGATTGTGGATGGGCTATTGGCCAGTAATCTTATCACCCCAATTCAATGATAAACAAAATTTACCGCGTGCGCAGGA

idx 15 locidx 752
7 T
12 (7, 5, 5, 0)
[ 0  0 18  0]
17 T
22 (17, 5, 5, 0)
[ 0  0 18  0]
29 G
34 (29, 5, 5, 0)
[ 0  0  0 18]
30 T
35 (30, 5, 5, 0)
[ 0  0 18  0]
32 C
37 (32, 5, 5, 0)
[18  0  0  0]
63 G
68 (63, 5, 5, 0)
[ 0  0  0 18]
72 G
77 (72, 5, 5, 0)
[ 0  0  0 18]
AATGTCGTCCGACGAGGTGTCACACGGGTGTACGCGCCCGCCTCTGTTCACAGCGCGTGGTTGGACAGGGAAGCCTTTCCCCTCGC

idx 15 locidx 753
38 C
43 (38, 5, 5, 0)
[12  0  0  0]
63 G
68 (63, 5, 5, 0)
[ 0

idx 16 locidx 802
25 C
30 (25, 5, 5, 0)
[20  0  0  0]
47 C
52 (47, 5, 5, 0)
[20  0  0  0]
54 R
59 (54, 5, 5, 0)
[ 0  8  0 12]
CCCGAATTCTACAGCCGGAAGGACCCTGGGGAGTGCCAGTACATCCCCCAGCTGRTACAATACATTCAATATCATCGGTTATTAAA

NOT HERE
idx 16 locidx 804
51 A
56 (51, 5, 5, 0)
[ 0 18  0  0]
65 T
70 (65, 5, 5, 0)
[ 0  0 18  0]
81 A
86 (81, 5, 5, 0)
[ 0 18  0  0]
GCCGTTGTAGTGGAAATTTAGGGTCTTAGTGTCCAGAAAGCCGTACGGATGACTCCCGGACGCTATCCGGAGTCAGATTCCAACGT

idx 16 locidx 805
9 T
14 (9, 5, 5, 0)
[ 0  0 19  0]
17 G
22 (17, 5, 5, 0)
[ 0  0  0 19]
22 G
27 (22, 5, 5, 0)
[ 0  0  0 19]
42 C
47 (42, 5, 5, 0)
[19  0  0  0]
GACGGCAAATATGACTCGAGAGGCCCATGTTAAGTAAGTACTCGGTCAGATACGCGACAATCCTGCTGGTTCGTGACGCCCGTCCT

idx 16 locidx 806
19 A
24 (19, 5, 5, 0)
[ 0 20  0  0]
37 T
42 (37, 5, 5, 0)
[ 0  0 20  0]
38 C
43 (38, 5, 5, 0)
[20  0  0  0]
79 T
84 (79, 5, 5, 0)
[ 0  0 20  0]
CTTAGAATCGTTACAGATCAGTTGGCGCGATAGCACGTCACACGTAGTAGGAACCTCGTCAAAATGCACAAATGGGCGGTCGTCTG

idx 16 locidx 807
2 T
7 (2, 5, 5, 0)
[ 0  0 19  0]
23 C
28 (23, 5

13 (8, 5, 5, 0)
[ 0  0 17  0]
11 A
16 (11, 5, 5, 0)
[ 0 17  0  0]
34 T
39 (34, 5, 5, 0)
[ 0  0 17  0]
55 A
60 (55, 5, 5, 0)
[ 0 17  0  0]
65 A
70 (65, 5, 5, 0)
[ 0 17  0  0]
77 C
82 (77, 5, 5, 0)
[17  0  0  0]
TAAAGCTCTCTACATAGTAATTCGAACGCGACGGTCTTCGGCAGCTCGCGGTCCGATCCCCACTTAGGCCGAATCGCCATCCGTTT

idx 17 locidx 851
44 C
49 (44, 5, 5, 0)
[18  0  0  0]
45 T
50 (45, 5, 5, 0)
[ 0  0 18  0]
50 T
55 (50, 5, 5, 0)
[ 0  0 18  0]
CCAGAGACTACCACTCTGATTCAGTTTATGATGTACTCGACCCTCTCTCTTTTTTTAGATGTGGTAAGATTTTATAAAAGTTAACT

idx 17 locidx 852
18 G
23 (18, 5, 5, 0)
[ 0  0  0 25]
20 G
25 (20, 5, 5, 0)
[ 0  0  0 25]
65 T
70 (65, 5, 5, 0)
[ 0  0 25  0]
GGGAGCCGGGGGCACGACGCGGATCTCCGTACAGTAGATCACGATCTCTACGAAATATGAAGAACTTGAGACCTAACCAACCCATT

idx 17 locidx 853
11 C
16 (11, 5, 5, 0)
[22  0  0  0]
58 A
63 (58, 5, 5, 0)
[ 0 22  0  0]
74 G
79 (74, 5, 5, 0)
[ 1  0  0 21]
GTGCTCTAGCACGAGGACATCCGAAGTTTCGATTAGCGTTGATGTCTAAACCTGGAACAACGCAAGTTGGTGTGGCTTCGAACCAG

idx 17 locidx 854
21 C
26 (21, 5, 5, 0)
[17  0  0  0]
ATAGGA


idx 18 locidx 908
33 T
38 (33, 5, 5, 0)
[ 0  0 17  0]
34 C
39 (34, 5, 5, 0)
[17  0  0  0]
48 A
53 (48, 5, 5, 0)
[ 0 17  0  0]
58 G
63 (58, 5, 5, 0)
[ 0  0  0 17]
64 T
69 (64, 5, 5, 0)
[ 0  0 17  0]
78 A
83 (78, 5, 5, 0)
[ 0 17  0  0]
83 G
88 (83, 5, 5, 0)
[ 0  0  0 17]
84 A
89 (84, 5, 5, 0)
[ 0 17  0  0]
ATCTTTGTCGTATCAGATCCTTCATTTTTCTACTCAACCGGTCGGGAAATCGAGAGACGTTGCATAGTAGTAGCGTACAAAGAGAC

NOT HERE
idx 18 locidx 910
3 G
8 (3, 5, 5, 0)
[ 0  0  0 16]
30 T
35 (30, 5, 5, 0)
[ 0  0 16  0]
47 C
52 (47, 5, 5, 0)
[16  0  0  0]
52 G
57 (52, 5, 5, 0)
[ 0  0  0 16]
TCTGCAGCCTGAACTATTGTTTTCTGTTTGTAGACGCACCGTCCTACCGAATGAAGGCTGTACAATCCCGTTGAGCGACTGGATAA

idx 18 locidx 911
20 C
25 (20, 5, 5, 0)
[19  0  0  0]
25 G
30 (25, 5, 5, 0)
[ 0  0  0 19]
59 T
64 (59, 5, 5, 0)
[ 0  0 19  0]
67 G
72 (67, 5, 5, 0)
[ 0  0  0 19]
GTCACATAAGGCCGACTTGTCATGAGTTAAGGCTCAGGACAATAGGACGAAAGTCAGTCTACATGCGGGATAGAAACTGTCGCCTA

idx 18 locidx 912
24 C
29 (24, 5, 5, 0)
[20  0  0  0]
51 S
56 (51, 5, 5, 0)
[ 7  0  0 13]
66 A
71 (

idx 19 locidx 966
0 C
5 (0, 5, 5, 0)
[23  0  0  0]
7 C
12 (7, 5, 5, 0)
[23  0  0  0]
14 C
19 (14, 5, 5, 0)
[23  0  0  0]
17 A
22 (17, 5, 5, 0)
[ 0 23  0  0]
20 G
25 (20, 5, 5, 0)
[ 0  0  0 23]
29 T
34 (29, 5, 5, 0)
[ 0  0 23  0]
71 C
76 (71, 5, 5, 0)
[22  0  1  0]
CGTTCAGCTGGAAACAAAACGATTGCCGGTCACCAAGCGATAAACAGTCGGAGGTGGTGGTTGGGCCCATACGCAGACAACCCCTG

idx 19 locidx 967
29 T
34 (29, 5, 5, 0)
[ 0  0 18  0]
54 C
59 (54, 5, 5, 0)
[18  0  0  0]
56 C
61 (56, 5, 5, 0)
[18  0  0  0]
80 C
85 (80, 5, 5, 0)
[18  0  0  0]
ATATAGAAACAAAGCCCCTCACTCCTAGATCATGGGGCGGCCTATTGATGTTAGCGCAACTAGTGTTCAGTAGAGTGCGACCTCGC

idx 19 locidx 968
13 G
18 (13, 5, 5, 0)
[ 0  0  0 20]
44 G
49 (44, 5, 5, 0)
[ 0  0  0 20]
AACGAAATCTGCTGTCGTCGAGCATTTATACGCTGGATGTATATGAAAGGTACAAAAGTACTGCGGACCGAACCTATGACGATATG

idx 19 locidx 969
14 T
19 (14, 5, 5, 0)
[ 0  0 14  0]
49 C
54 (49, 5, 5, 0)
[14  0  0  0]
50 T
55 (50, 5, 5, 0)
[ 0  0 14  0]
CGAAACCGGCGCACTTCTCCTGGAATTCCAGCGCCACTTCGCCTAGCATCTGCGGTGTCTCAGACATGAAGAACGCGTAGAACTTG

idx 1

In [77]:
seq

'TGGGGTATGTGGTCAATCCATAGACATTATGCGTTCTTCGCACCAGTATTCCACACATTTTATCTAACGAGATGTGTGCCCAAACG'

In [42]:

def merge(name, assemblies, rename_dict=None):
    """
    Creates and returns a new Assembly object in which samples from two or more
    Assembly objects with matching names are 'merged'. Merging does not affect 
    the actual files written on disk, but rather creates new Samples that are 
    linked to multiple data files, and with stats summed.

    # merge two assemblies
    new = ip.merge('newname', (assembly1, assembly2))

    # merge two assemblies and rename samples
    rename = {"1A_0", "A", "1B_0", "A"}
    new = ip.merge('newname', (assembly1, assembly2), rename_dict=rename)
    """
    # create new Assembly
    merged = Assembly(name)

    # one or multiple assemblies?
    try:
        _ = len(assemblies)
    except TypeError:
        assemblies = [assemblies]

    # iterate over all sample names from all Assemblies
    for data in assemblies:

        # make a deepcopy
        ndata = copy.deepcopy(data)
        for sname, sample in ndata.samples.items():
            
            # rename sample if in rename dict
            if sname in rename_dict:
                sname = rename_dict[sname]
                sample.name = sname
                
            # is it in the merged assembly already
            if sname in merged.samples:
                msample = merged.samples[sname]
                
                # update stats
                msample.stats.reads_raw += sample.stats.reads_raw
                if sample.stats.reads_passed_filter:
                    msample.stats.reads_passed_filter += sample.stats.reads_passed_filter
                
                # append files
                msample.files.fastqs.append(sample.files.fastqs)
                msample.files.edits.append(sample.files.edits)

                # do not allow state >2 at merging (requires reclustering)
                # if merging WITHIN samples.
                msample.stats.state = min(sample.stats.state, 2)

            # merge its stats and files
            else:
                merged.samples[sname] = sample

    # Merged assembly inherits max of hackers values (max frag length)
    merged.hackersonly.max_fragment_length = max(
        [i.hackersonly.max_fragment_length for i in assemblies])

    # Set the values for some params that don't make sense inside mergers
    merged_names = ", ".join([i.name for i in assemblies])
    merged.params.raw_fastq_path = "Merged: " + merged_names
    merged.params.barcodes_path = "Merged: " + merged_names
    merged.params.sorted_fastq_path = "Merged: " + merged_names

    # return the new Assembly object
    merged.save()
    return merged