JaneliaSciComp
diff --git a/‎README.md‎
Lines changed: 40 additions & 103 deletions b/‎README.md‎
Lines changed: 40 additions & 103 deletions
diff --git a/‎configuration.py‎
Lines changed: 2 additions & 1 deletion b/‎configuration.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎src/PCA.py‎
Lines changed: 175 additions & 0 deletions b/‎src/PCA.py‎
Lines changed: 175 additions & 0 deletions
@@ -40,6 +40,7 @@ Table of Contents
       * [Event Detection](#event-detection)
       * [Double-Click Annotations](#double-click-annotations)
       * [Network Architecture](#network-architecture)
+      * [Clustering Algorithm](#clustering-algorithm)
    * [Troubleshooting](#troubleshooting)
    * [Frequently Asked Questions](#frequently-asked-questions)
    * [Reporting Problems](#reporting-problems)
@@ -904,21 +905,21 @@ in the `Ground Truth` directory: "activations.log", "activations-samples.log",
 and "activations.npz".  The two ending in ".log" report any errors, and the
 ".npz" file contains the actual data in binary format.
 
-Now reduce the dimensionality of the hidden state activations
-to either two or three dimensions with the `Cluster` button.
-Choose to do so using either UMAP ([McInnes, Healy, and Melville
-(2018)](https://arxiv.org/abs/1802.03426)), t-SNE ([van der Maaten and
-Hinton (2008)](http://www.jmlr.org/papers/v9/vandermaaten08a.html)), or PCA.
-UMAP and t-SNE are each controlled by separate parameters (`neighbors` and
-`distance`, and `perplexity` and `exaggeration` respectively), a description
-of which can be found in the aforementioned articles.  UMAP and t-SNE can
-also be optionally preceded by PCA, in which case you'll need to specify
-the fraction of coefficients to retain using `PCA fraction`.  You'll also
-need to choose to cluster just the last hidden layer using the "layers"
-multi-select box.  Output are two or three files in the `Ground Truth`
-directory: "cluster.log" contains any errors, "cluster.npz" contains binary
-data, and "cluster-pca.pdf" shows the results of the principal components
-analysis (PCA) if one was performed.
+Now reduce the dimensionality of the hidden state activations to either two or
+three dimensions with the `Cluster` button.  By default, Songexplorer uses the
+UMAP algorithm ([McInnes, Healy, and Melville
+(2018)](https://arxiv.org/abs/1802.03426)), but tSNE and PCA can be used
+instead via a plugin (see [Clustering Algorithm](#clustering-algorithm)).  For
+now, leave the `neighbors` and `distance` parameters set to their default
+values.  A description of how they change the resulting clusters can be found
+in the aforementioned article.  Also leave the `PCA fraction` parameter at its
+default.  In future, if you find clustering slow for larger data sets, UMAP can
+be preceded by PCA, and the fraction of coefficients that are retained is
+specified using `PCA fraction`.  Lastly, choose to cluster just the last hidden
+layer using the "layers" multi-select box.  Output are two or three files in
+the `Ground Truth` directory: "cluster.log" contains any errors, "cluster.npz"
+contains binary data, and "cluster-pca.pdf" shows the results of the principal
+components analysis (PCA) if one was performed.
 
 Finally, click on the `Visualize` button to render the clusters in the
 left-most panel.  Adjust the size and transparency of the markers using
@@ -1756,46 +1757,7 @@ supply your own code instead.  Simply put in a python file a list called
 script which uses those parameters to generate a "detected.csv" given a WAV
 file.  Then change the "detect_plugin" variable in your "configuration.py"
 file to point to the full path of this python file, without the ".py"
-extension.  See the minimal example in "src/detect-plugin.py" for a template,
-a pared down version of which is as follows:
-
-    #!/usr/bin/python3
-
-    # a list of lists specifying the detect-specific hyperparameters in the GUI
-    detect_parameters = [
-        ["my-simple-textbox", "h-parameter 1", "", "32", [], None, True],
-        ]
-
-    # a function which returns a vector of strings used to annotate the detected events
-    def detect_labels(audio_nchannels):
-        # kinds = [... ]
-        return kinds
-
-    # a script which inputs a WAV file and outputs a CSV file
-    if __name__ == '__main__':
-
-        import os
-        import scipy.io.wavfile as spiowav
-        import sys
-        import csv
-        import json
-
-
-        _, filename, detect_parameters, audio_tic_rate, audio_nchannels = sys.argv
-
-        detect_parameters = json.loads(detect_parameters)
-        hyperparameter1 = int(detect_parameters["my-simple-textbox"])
-
-        _, song = spiowav.read(filename)
-
-        # add logic here to find events of interest
-        # events = [...]  # e.g. a list of 3-tuples with start, stop, kind
-
-        basename = os.path.basename(filename)
-        with open(os.path.splitext(filename)[0]+'-detected.csv', 'w') as fid:
-            csvwriter = csv.writer(fid)
-            for e in range(events):
-                csvwriter.writerow([basename, e[0], e[1], 'detected', e[2]])
+extension.  See the minimal example in "src/detect-plugin.py" for a template.
 
 ## Double-Click Annotations ##
 
@@ -1829,54 +1791,29 @@ The default network architecture is a set of layered convolutions, the depth
 and width of which can be configured as described above.  Should this not prove
 flexible enough, SongExplorer is designed with a means to supply your own
 TensorFlow code that implements a whiz bang architecture of any arbitrary
-design.  See the minimal example in "src/architecture-plugin.py" for a
-template of how this works, a pared down version of which is as follows:
-
-    import tensorflow as tf
-
-    # a list of lists specifying the architecture-specific hyperparameters in the GUI
-    model_parameters = [
-      # each hyperparameter is described by a list with these entries:
-      # [ key in `model_settings`,
-      #   title in GUI,
-      #   "" for textbox or [] for pull-down,
-      #   default value,
-      #   enable logic,
-      #   callback,
-      #   required ]
-      ]
-
-    # a function which returns a keras model
-    def create_model(model_settings):
-        # `model_settings` is a superset of the hyperparameters above.  see src/models.py
-
-        # hidden_layers is used to visualize intermediate clusters in the GUI
-        hidden_layers = []
-
-        # 'parallelize' specifies the number of output tics to classify simultaneously
-        ninput_tics = model_settings["context_tics"] + model_settings["parallelize"] - 1
-        input_layer = Input(shape=(ninput_tics, model_settings["nchannels"]))
-
-        # add custom layers here, e.g. x = Conv1D()(x)
-        # append interesting ones to hidden_layers
-
-        # last layer must be convolutional with nlabels as the output size
-        output_layer = Conv1D(model_settings['nlabels'], 1)(x)
-
-        return tf.keras.Model(inputs=input_layer, outputs=[hidden_layers, output_layer])
-
-In brief, two objects must be supplied in a python file:  (1) a list named
-`model_parameters` which defines the variable names, titles, and default
-values, etc. to appear in the GUI, and (2) a function `create_model` which
-builds and returns the network graph.  Specify as the `architecture_plugin` in
-"configuration.py" the full path to this file, without the ".py" extension.
-The buttons immediately above the configuration textbox in the GUI will
-change to reflect the different hyperparameters used by this architecture.
-All the workflows described above (detecting sounds, making predicions, fixing
-mistakes, etc) can be used with this custom network in an identical manner.
-The default convolutional architecture is itself written as a plug-in, and
-can be found in "src/convolutional.py".
-
+design.  See the minimal example in "src/architecture-plugin.py" for a template
+of how this works.  In brief, two objects must be supplied in a python file:
+(1) a list named `model_parameters` which defines the variable names, titles,
+and default values, etc. to appear in the GUI, and (2) a function
+`create_model` which builds and returns the network graph.  Specify as the
+`architecture_plugin` in "configuration.py" the full path to this file, without
+the ".py" extension.  The buttons immediately above the configuration textbox
+in the GUI will change to reflect the different hyperparameters used by this
+architecture.  All the workflows described above (detecting sounds, making
+predicions, fixing mistakes, etc) can be used with this custom network in an
+identical manner.  The default convolutional architecture is itself written as
+a plug-in, and can be found in "src/convolutional.py".
+
+## Clustering Algorithm ##
+
+The method used to reduce the dimensionality of the activations for
+visualization is also a plugin.  By default, the UMAP algorithm is used, but
+also included are plugins for t-SNE ([van der Maaten and Hinton
+(2008)](http://www.jmlr.org/papers/v9/vandermaaten08a.html)) and PCA.  To use
+these alternatives change "cluster_plugin" in "configuration.py" to "tSNE" or
+"PCA", respectively.  To create your own plugin, write a script which defines a
+list called `cluster_parameters`, inputs "activations.npz" and outputs
+"cluster.npz".  See "src/cluster-plugin.py" for a template.
 
 # Troubleshooting #
 
 
@@ -60,7 +60,7 @@
 gui_context_spectrogram_height_pix=150
 gui_context_probability_height_pix=75
 gui_context_undo_proximity_pix=3
-gui_context_doubleclick_plugin="point"
+gui_context_doubleclick_plugin="point"  # or snap-to
 gui_spectrogram_colormap="Viridis256"
 gui_spectrogram_window="hann"
 gui_spectrogram_length_sec=0.010
@@ -126,6 +126,7 @@
 cluster_ngpu_cards=-1
 cluster_ngigabytes_memory=-1
 cluster_cluster_flags=""
+cluster_plugin="UMAP"  # or tSNE, PCA
 
 accuracy_where=default_where
 accuracy_ncpu_cores=-1
 
@@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+
+# reduce dimensionality of internal activation states with PCA
+
+# e.g. PCA.py \
+#     --data_dir=`pwd`/groundtruth-data \
+#     --layers=0,1,2,3,4 \
+#     --pca_batch_size=5 \
+#     --parallelize=0 \
+#     --parameters='{"ndims":2}'
+ 
+import argparse
+import os
+import numpy as np
+import sys
+from sklearn.decomposition import PCA, IncrementalPCA
+from natsort import natsorted
+from datetime import datetime
+import socket
+from itertools import repeat
+
+import json
+
+def cluster_parameters():
+    return [
+          ["ndims", "# dims", ["2","3"], "2", 1, [], None, True],
+    ]
+
+def do_cluster(activations_flattened, ilayer, layers, parameters):
+    if ilayer in layers:
+        print("reducing dimensionality of layer "+str(ilayer)+" with PCA...")
+        mu = np.mean(activations_flattened[ilayer], axis=0)
+        sigma = np.std(activations_flattened[ilayer], axis=0)
+        activations_scaled = (activations_flattened[ilayer] - mu) / sigma
+        if FLAGS.pca_batch_size==0:
+            pca = PCA()
+        else:
+            nfeatures = np.shape(activations_scaled)[1]
+            pca = IncrementalPCA(batch_size = FLAGS.pca_batch_size * nfeatures)
+        fit = pca.fit(activations_scaled)
+        return fit, fit.transform(activations_scaled)[:,0:int(parameters["ndims"])]
+    else:
+        return None, None
+
+FLAGS = None
+
+def main():
+  flags = vars(FLAGS)
+  for key in sorted(flags.keys()):
+    print('%s = %s' % (key, flags[key]))
+
+  layers = [int(x) for x in FLAGS.layers.split(',')]
+
+  print("loading data...")
+  activations=[]
+  npzfile = np.load(os.path.join(FLAGS.data_dir, 'activations.npz'),
+                    allow_pickle=True)
+  sounds = npzfile['sounds']
+  for arr_ in natsorted(filter(lambda x: x.startswith('arr_'), npzfile.files)):
+    activations.append(npzfile[arr_])
+
+  nlayers = len(activations)
+
+  kinds = set([x['kind'] for x in sounds])
+  labels = set([x['label'] for x in sounds])
+  print('label counts')
+  for kind in kinds:
+    print(kind)
+    for label in labels:
+      count = sum([label==x['label'] and kind==x['kind'] for x in sounds])
+      print(count,label)
+
+  activations_flattened = [None]*nlayers
+  for ilayer in layers:
+    nsounds = np.shape(activations[ilayer])[0]
+    activations_flattened[ilayer] = np.reshape(activations[ilayer],(nsounds,-1))
+    print("shape of layer "+str(ilayer)+" is "+str(np.shape(activations_flattened[ilayer])))
+
+  fits_pca = [None]*nlayers
+  activations_scaled = [None]*nlayers
+
+  if FLAGS.parallelize!=0:
+    from multiprocessing import Pool
+    nprocs = os.cpu_count() if FLAGS.parallelize==-1 else FLAGS.parallelize
+    with Pool(min(nprocs,nlayers)) as p:
+      fits, activations_clustered = zip(*p.starmap(do_cluster,
+                                                   zip(repeat(activations_flattened),
+                                                       range(len(activations_flattened)),
+                                                       repeat(layers),
+                                                       repeat(FLAGS.parameters))))
+  else:
+    fits = [None]*nlayers
+    activations_clustered = [None]*nlayers
+    for ilayer in layers:
+      print("reducing dimensionality of layer "+str(ilayer)+" with PCA...")
+      fits[ilayer], activations_clustered[ilayer] = do_cluster(activations_flattened,
+                                                               ilayer,
+                                                               layers,
+                                                               FLAGS.parameters)
+
+  import matplotlib as mpl
+  mpl.use('Agg')
+  import matplotlib.pyplot as plt
+  #plt.ion()
+
+  fig = plt.figure()
+  ax = fig.add_subplot(111)
+  for ilayer in layers:
+    cumsum = np.cumsum(fits[ilayer].explained_variance_ratio_)
+    line, = ax.plot(cumsum)
+    line.set_label('layer '+str(ilayer))
+
+  ax.set_ylabel('cumsum explained variance')
+  ax.set_xlabel('# of components')
+  ax.legend(loc='lower right')
+  plt.savefig(os.path.join(FLAGS.data_dir, 'cluster.pdf'))
+
+  np.savez(os.path.join(FLAGS.data_dir, 'cluster'), \
+           sounds = sounds,
+           activations_clustered = np.array(activations_clustered, dtype=object),
+           fits = np.array(fits, dtype=object) if FLAGS.save_fits else None,
+           labels_touse = npzfile['labels_touse'],
+           kinds_touse = npzfile['kinds_touse'])
+  
+def str2bool(v):
+    if v.lower() in ('yes', 'true', 't', 'y', '1'):
+        return True
+    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+        return False
+    else:
+        raise argparse.ArgumentTypeError('Boolean value expected.')
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      '--data_dir',
+      type=str)
+  parser.add_argument(
+      '--layers',
+      type=str)
+  parser.add_argument(
+      '--pca_batch_size',
+      type=int)
+  parser.add_argument(
+      '--parallelize',
+      type=int,
+      default=0)
+  parser.add_argument(
+      '--parameters',
+      type=json.loads,
+      default='{"neighbors": 10, "distance": 0.1}')
+  parser.add_argument(
+      '--save_fits',
+      type=str2bool,
+      default=False,
+      help='Whether to save the cluster models')
+
+  FLAGS, unparsed = parser.parse_known_args()
+
+  print(str(datetime.now())+": start time")
+  repodir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
+  with open(os.path.join(repodir, "VERSION.txt"), 'r') as fid:
+    print('SongExplorer version = '+fid.read().strip().replace('\n',', '))
+  print("hostname = "+socket.gethostname())
+  
+  try:
+    main()
+
+  except Exception as e:
+    print(e)
+  
+  finally:
+    if hasattr(os, 'sync'):
+      os.sync()
+    print(str(datetime.now())+": finish time")