# Build PNG Files

In this notebook, we'll take the `basic` data set, use `ibmseti` Python package to convert each data file into a spectrogram, then save as `.png` files.


Also, we'll split the data set into a training set and a test set and create a handful of zip files for each class. This will dovetail into the next tutorial where we will train a custom Watson Visual Recognition classifier (we will use the zip files of pngs) and measure it's performance with the test set. 

In [1]:
from __future__ import division

import cStringIO
import glob
import json
import requests
import ibmseti
import os
import zipfile
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#Making a local folder to put my data.

#NOTE: YOU MUST do something like this on a Spark Enterprise cluster at the hackathon so that
#you can put your data into a separate local file space. Otherwise, you'll likely collide with 
#your fellow participants. 

mydatafolder = os.environ['PWD'] + '/' + 'my_team_name_data_folder'
if os.path.exists(mydatafolder) is False:
    os.makedirs(mydatafolder)

In [3]:
#If you are running this in IBM Apache Spark (via Data Science Experience)
base_url = 'https://dal05.objectstorage.service.networklayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#ELSE, if you are outside of IBM:
#base_url = 'https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#NOTE: if you are outside of IBM, pulling down data will be slower. :/

In [4]:
## You don't need to repeat this, of course, if you've already done this in the Step 1 notebook

basic4zip = '{}/simsignals_basic_v2/basic4.zip'.format(base_url)
os.system('curl {} > {}/{}'.format(basic4zip, mydatafolder, 'basic4.zip'))

0

In [5]:
!ls -alrht $mydatafolder

total 3.3G
drwxr-xr-x 6 sd2d-634b36332a0fab-8605aaf2c6e1 users 4.0K Jun  8 18:38 ..
drwx------ 2 sd2d-634b36332a0fab-8605aaf2c6e1 users 4.0K Jun  8 18:38 .
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users 1.1G Jun  8 18:38 basic4.zip


In [6]:
outputpng_folder = mydatafolder + '/png'
if os.path.exists(outputpng_folder) is False:
    os.makedirs(outputpng_folder)

In [7]:
zz = zipfile.ZipFile(mydatafolder + '/' + 'basic4.zip')

In [8]:
#Use `ibmseti`, or other methods, to draw the spectrograms

def draw_spectrogram(data):
    
    aca = ibmseti.compamp.SimCompamp(data)
    spec = aca.get_spectrogram()

    # Instead of using SimCompAmp.get_spectrogram method
    # perform your own signal processing here before you create the spectrogram
    #
    # SimCompAmp.get_spectrogram is relatively simple. Here's the code to reproduce it:
    #
    # header, raw_data = r.content.split('\n',1)
    # complex_data = np.frombuffer(raw_data, dtype='i1').astype(np.float32).view(np.complex64)
    # shape = (32, 6144)
    # spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data.reshape(*shape) ), 1) )**2
    # 
    # But instead of the line above, can you maniputlate `complex_data` with signal processing
    # techniques in the time-domain (windowing?, de-chirp?), or manipulate the output of the 
    # np.fft.fft process in a way to improve the signal to noise (Welch periodogram, subtract noise model)? 
    # 
    # example: Apply Hanning Window
    # complex_data = complex_data.reshape(*shape)
    # complex_data = complex_data * np.hanning(complex_data.shape[1])
    # spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data ), 1) )**2


    fig, ax = plt.subplots(figsize=(10, 5))   

    # do different color mappings affect Watson's classification accuracy?
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='hot')
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='Greys')
    
    ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0])
    
    return fig, aca.header()


In [9]:
## We're going to use Spark to distribute the job of creating the PNGs on the executor nodes

rdd = sc.parallelize(zz.namelist(), 120) #30 executors are available on Enterprise clusters

In [10]:
def extract_data(row):
    return (row, zz.open(row).read())

rdd = rdd.map(extract_data)

In [11]:
def convert_to_spectrogram_and_save(row):
    name = row[0]
    fig, header = draw_spectrogram(row[1])
    png_file = name + '.png'
    fig.savefig(outputpng_folder + '/' + png_file)
    plt.close(fig)
    return (name, header, png_file)

In [12]:
rdd = rdd.map(convert_to_spectrogram_and_save)

In [13]:
results = rdd.collect()  #This took about 70s on my Enterprise cluster. It will take longer on your free-tier. 

In [14]:
results[0]

('000919a5-bc7f-471e-959c-81adba0b1f36.dat',
 {u'signal_classification': u'squiggle',
  u'uuid': u'000919a5-bc7f-471e-959c-81adba0b1f36'},
 '000919a5-bc7f-471e-959c-81adba0b1f36.dat.png')

# Create Training / Test sets

Using the `basic` list, we'll create training and test sets for each signal class. Then we'll archive the `.png` files into a handful of `.zip` files (We need the .zip files to be smaller than 100 MB because there is a limitation with the size of batches of data that are uploaded to Watson Visual Recognition when training a classifier.)

In [15]:
# Grab the Basic file list in order to 
# Organize the Data into classes

r = requests.get('{}/simsignals_files/public_list_basic_v2_26may_2017.csv'.format(base_url), timeout=(9.0, 21.0))

uuids_classes_as_list = r.text.split('\n')[1:-1]  #slice off the first line (header) and last line (empty)

def row_to_json(row):
    uuid,sigclass = row.split(',')
    return {'uuid':uuid, 'signal_classification':sigclass}

uuids_classes_as_list = map(lambda row: row_to_json(row), uuids_classes_as_list)
print "found {} files".format(len(uuids_classes_as_list))

uuids_group_by_class = {}
for item in uuids_classes_as_list:
    uuids_group_by_class.setdefault(item['signal_classification'], []).append(item)

found 4000 files


In [16]:
training_percentage = 0.70

training_set_group_by_class = {}
test_set_group_by_class = {}
for k, v in uuids_group_by_class.iteritems():
    
    total = len(v)
    training_size = int(total * training_percentage)

    training_set = v[0:training_size]
    test_set = v[training_size:total]
    
    training_set_group_by_class[k] = training_set
    test_set_group_by_class[k] = test_set
    
    print '{}: training set size: {}'.format(k, len(training_set))
    print '{}: test set size: {}'.format(k, len(test_set))

squiggle: training set size: 700
squiggle: test set size: 300
narrowband: training set size: 700
narrowband: test set size: 300
noise: training set size: 700
noise: test set size: 300
narrowbanddrd: training set size: 700
narrowbanddrd: test set size: 300


In [17]:
training_set_group_by_class['noise'][0]

{'signal_classification': u'noise',
 'uuid': u'498becc2-3693-45b3-8533-50e93532706a'}

In [18]:
fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]

In [19]:
zipfilefolder = mydatafolder + '/zipfiles'
if os.path.exists(zipfilefolder) is False:
    os.makedirs(zipfilefolder)

In [20]:
max_zip_file_size_in_mb = 25

In [21]:
#Create the Zip files containing the training PNG files
#Note that this limits output files to be less than <max_zip_file_size_in_mb> MB because WatsonVR has a limit on the 
#size of input files that can be sent in single HTTP calls to train a custom classifier

for k, v, in training_set_group_by_class.iteritems():
    
    fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]  #yes, files are <uuid>.dat.png :/
    
    count = 1
    for fn in fnames:
        
        archive_name = '{}/classification_{}_{}.zip'.format(zipfilefolder, count, k)
        
        if os.path.exists(archive_name):
            zz = zipfile.ZipFile(archive_name, mode='a')
        else:
            print 'creating new archive', archive_name
            zz = zipfile.ZipFile(archive_name, mode='w')
           
        zz.write(fn)
        zz.close()
        
        #if archive_name folder exceeds <max_zip_file_size_in_mb> MB, increase count to create a new one
        if os.path.getsize(archive_name) > max_zip_file_size_in_mb * 1024 ** 2:
            count += 1
            

creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/classification_1_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/classification_2_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/classification_3_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/classification_4_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/classification_5_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/classification_6_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zip

In [22]:
#Create the Zip files containing the test PNG files
#Note that this limits output files to be less than <max_zip_file_size_in_mb> MB because WatsonVR has a limit on the 
#size of input files that can be sent in single HTTP calls to train a custom classifier

for k, v, in test_set_group_by_class.iteritems():
    
    fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]  #yes, files are <uuid>.dat.png :/
    
    count = 1
    for fn in fnames:
        
        archive_name = '{}/testset_{}_{}.zip'.format(zipfilefolder, count, k)
        
        if os.path.exists(archive_name):
            zz = zipfile.ZipFile(archive_name, mode='a')
        else:
            print 'creating new archive', archive_name
            zz = zipfile.ZipFile(archive_name, mode='w')
           
        zz.write(fn)
        zz.close()
        
        #if archive_name folder exceeds <max_zip_file_size_in_mb> MB, increase count to create a new one
        if os.path.getsize(archive_name) > max_zip_file_size_in_mb * 1024 ** 2:
            count += 1
            

creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_1_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_2_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_3_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_4_squiggle.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_1_narrowband.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_2_narrowband.zip
creating new archive /gpfs/fs01/user/sd2d-634b36332a0fab-8605aaf2c6e1/notebook/work/my_team_name_data_folder/zipfiles/testset_3_narrowband.zip
creatin

In [23]:
!ls -alrth $mydatafolder/zipfiles

total 4.0G
drwx------ 4 sd2d-634b36332a0fab-8605aaf2c6e1 users 4.0K Jun  8 18:41 ..
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:41 classification_1_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:41 classification_2_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_3_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_4_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_5_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_6_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_7_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_8_squiggle.zip
-rw------- 1 sd2d-634b36332a0fab-8605aaf2c6e1 users  26M Jun  8 18:42 classification_9_squiggle.zip
-rw--