The table below describes the Clustering pipeline steps that are available in FCS Express. If you would like to recommend additional Clustering methods to be provided with FCS Express, please contact support@denovosoftware.com.

 

 

Clustering

 

 

Step

 

 

Description

Batch Self-Organizing Map

The Batch Self-Organizing Map step performs Batch SOM clustering on the input population, using parameters selected in the Parameter Options list (see image below). Parameters can be filtered or sorted to assist in the selection of parameters when multiple parameters are available in the template file. Using the Filter: field a user can remove unwanted parameters from view to simplify selection.  For example, if a user only wanted to select area parameters typing "-A" in the field would reduce the number of parameters seen in the parameter list. By right clicking in the parameters section, you can use Sort Ascending, Sort Descending, or Unsorted to easily manage parameter ordering and facilitate parameter selection.  In the right click menu, you can easily select parameters by utilizing Check All, Uncheck All, Check Selected, Uncheck Selected, Invert Selection on All.  The options Check Selected and Uncheck Selected allow for using the shift key or Ctrl key to multi-select parameters and check or uncheck them all simultaneously.

 

The output of the Batch SOM (Batch Self-Organizing Map) step is a list of Cluster Assignment, which is an internal label indicating the SOM node membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heatmap plots, and Data Grid.

 

The output parameter of the Batch SOM pipeline step can be used as input parameter for the Minimum Spanning Tree pipeline step.

 

Once the Batch SOM clustering is displayed on the Plate Heat Map, the following actions can be performed on the Plate Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis.

 

Some background on Batch Self-Organizing Map (i.e. Batch SOM) can be found below.

 

Batch SOM is a variation of the original SOM algorithm (please refer to the SOM section for more details on Self-Organizing Maps) with the following advantages:

In the Batch SOM there is no Learning Rate function so the number of required parameters is reduced.

The Batch SOM algorithm converges faster than the stepwise "gradient" method.

The Batch SOM algorithms is faster than the SOM algorithm since it can run using all the cores in the computer.

 

The main difference between the classical SOM and the Batch SOM is in the way the 2D grid is trained. With classical SOM, the 2D Grid is trained and updated one data point at a time (i.e. one data point at every iteration), while in the Batch SOM, the 2D Grid is trained using the entire dataset as a single iteration then updating the 2D Grid. With every iteration, the Winner node (also called Best Matching Unit, or "BMU") is calculated for every data point of the dataset, not just for a single randomly-selected data point.

 

That means that in every iteration and for every node, the set of data points that have that node as their BMU is known.

With the Neighborhood function (please refer to the SOM section for more details ), the set of data points referring to each neighborhood can also be defined (i.e. all the data points whose BMU is a node of the neighborhood).

In the Weight Updating phase, the location of each SOM node in the high dimensional space is updated so that each node moves towards the center of all the points referring to its neighborhood. The location is calculated as the cumulative contribution of each data points properly weighted on the high dimensional location using the neighborhood distance between the node of interest and their BMU.

 

The training of the 2D Grid is performed by repeating the procedure above several times in two subsequent training phases.

 

The two training phases are:

 

1.Coarse training phase. During this phase, the neighborhood spread decreases (see the Training Decay Function option below) from an initial value to a final value. The number of cycle, and both the Initial and Final Neighborhood Spreads values for this phase, can be defined by the user via the Coarse Training Options fields (see image below). Note that by default the Initial and Final Neighborhood Spreads values are automatically calculated.

2.Fine-Tune training phase. During this phase the neighborhood spread decreases (see the Training Decay Function option below) from an initial value to a final value. The initial value of the neighborhood spread can not exceed the final value of the neighborhood spread of the coarse phase (as a recommendation, the default initial value of the neighborhood spread for this phase should match the final value of the neighborhood spread of the coarse phase). Also, the final value of the neighborhood spread in this phase cannot be less then half of its initial value in this phase (as a recommendation, the neighborhood spread value in this phase should be held constant). The number of cycles, and both the Initial and Final Neighborhood Spreads values for this phase, can be defined by the user via the Fine-Tune Training Options fields (see image below). Note that by default the Initial and Final Neighborhood Spreads values are automatically calculated.

 

pipeline_BatchSOM

 

The output of the Batch SOM step is a list of Cluster Assignment, which is an internal label indicating the SOM node membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heat Map plots, and Data Grid.

 

The high-dimensional locations of the nodes can be used as Input parameter for the Minimum Spanning Tree pipeline step.

 

The name of the new parameter created by the Batch SOM algorithm can also be customized by the user via the "New Parameter Name" field.

 

The Batch SOM algorithm is stochastic. To make it reproducible, a fixed Seed may be set. If the same dataset and the settings are used, by retaining the same Random Seed value, the same result will be achieved. SOM may be run multiple times with different Random Seeds to evaluate the stability and the consistency of the result.

 

Once the Batch SOM clustering is displayed on the Plate Heat Map, the following actions can be performed on the Plate Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis. Meta-clusters may also be automatically gated using Heat Map Well Gates

 

 

Louvain Communities

 

The Louvain method for community detection is a method to extract communities from large networks created by Blondel et al. from the University of Louvain in 2008.

 

The input dataset is represented by a weighted network of data points (i.e. cells) in which each data point is connected with its closest neighbor by a weighted edge. In this representation, communities (i.e. phenotypically-similar cells), are expected to manifest as groups of highly-interconnected nodes. In the Louvain method, communities are indeed defined by maximizing the local modularity, which is a parameter that measures the relative density of connections inside communities with respect to connections outside communities.

 

The process is formally executed in 3 different steps:

 

1) Local move phase

Each node in the network is assigned to its own community. Then, given a node i of the network, the change in modularity is calculated for removing from its own community and moving it into the community of each neighbor j of i. Once all the neighbors j have been tested, node i is moved to the community j that ensure the largest increase in modularity. If none of the moves increases modularity, node i is not moved to any community.

To speed up the calculation, the process described above is not repeated for each node of the network, but just for a subset of them. The list of revisited nodes is indeed “pruned” from the entire network to a much smaller group of nodes that have a high potential to actually swap communities. This approach is called “Louvain pruning”.

The Local move phase stops when it is no longer possible to increase the local modularity by performing a node move. 

 

2) Aggregation phase

Each community found in Step 1 is grouped into one super-node to create a weighted network of super-nodes.

 

3) Repeat

The process restarts, with Step 1 being run on the network of super-nodes generated in Step 2. The entire process repeats until no increase in local modularity is gained.

 

 

The Louvain algorithm is a greedy heuristic algorithm. Solving the community assignments that maximize the overall modularity is computationally infeasible. The Louvain method acknowledges this and attempts to find the best "local solution". It does this by randomly selecting a path through the network in Step 1 above. The random path impacts the final results because each time a node moves, the local modularity profile changes accordingly. This, in turn, influences the neighbor's decision to merge. For this reason, the overall calculation (Steps 1-2-3) can be executed more than once, with a different random start each time. The number of time the whole process should restart is defined by the user with the Random Restarts option (see picture below). 5 restarts is usually a good value to try. The best results are then chosen using the final modularity score. The community assignments with the highest overall modularity score are selected.

 

To make the process reproducible, a fixed Seed may be set. If the same dataset and the settings are used, by retaining the same Random Seed value, the same result will be achieved. The Louvain Community pipeline step may be run multiple times with different Random Seeds to evaluate the stability and the consistency of the result.

 

Other notes to consider:

Convergence of the Louvain algorithm is guaranteed. 

The first iteration of the Louvain algorithm (i.e. the first time Steps 1 and 2 are executed) takes the majority of the execution time. However, FCS Express vastly speeds up this part by using the “Louvain pruning” technique.

Even very large networks tend to converge in less than 10 overall iterations. This is because the network “shrinks” with each iteration, so there are fewer nodes to visit for moving.

 

pipeline_Louvain

 

The output of the Louvain Community pipeline step is a new parameter called Louvain Communities defining the community assignment for each cell of the input dataset. The "Louvain Communities" parameter can be plotted on Histograms, 2D plots and Plate Heat Map. We suggest using Plate Heat Maps, since it allows the following actions:

Modification of the well size

Well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis.

 

 

Consensus Clustering

 

Performs Consensus Clustering, on the input population, using the parameters selected in the Parameter Options list (see image below). Parameters can be filtered or sorted to assist in the selection of parameters when multiple parameters are available in the template file. Using the Filter: field a user can remove unwanted parameters from view to simplify selection.  For example, if a user only wanted to select area parameters typing "-A" in the field would reduce the number of parameters seen in the parameter list. By right clicking in the parameters section, you can use Sort Ascending, Sort Descending, or Unsorted to easily manage parameter ordering and facilitate parameter selection.  In the right click menu, you can easily select parameters by utilizing Check All, Uncheck All, Check Selected, Uncheck Selected, Invert Selection on All.  The options Check Selected and Uncheck Selected allow for using the shift key or Ctrl key to multi-select parameters and check or uncheck them all simultaneously.

 

Consensus clustering is a technique to find a single clustering result (i.e. a consensus) when many clustering runs are performed.

Consensus clustering in FCS Express is performed with the Monti method (Monti et al. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning. 52. 91-118. 10.1023/A:1023949509487). Briefly, subsets of points are sampled from the input dataset and clustering is performed on each subset. A pairwise consensus matrix is then created to calculate the frequency with which every pair of points are found in the same cluster. The consensus matrix (M) can then be converted to be a distance/dissimilarity matrix (1-M) and further used to perform a final clustering step, which finally give the consensus clustering.

 

This clustering approach is quite computationally heavy, the highest number of objects that can be clustered is set to 8000. So, the Consensus Clustering step is suitable to cluster clusters generated by other clustering steps (which number is usually lower than 8000). This procedure is also called Meta-clustering.

 

The output of the Consensus Clustering step is a list of Cluster Assignment, which is an internal label indicating the cluster membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heatmap plots, and Data Grid.

 

The Cluster Assignment can be used as Input parameter for the Minimum Spanning Tree pipeline step.

 

When the input parameter is a cluster assignment (common occurrence with this type of clustering; see comment above), the current clustering steps performs a meta-clustering. In this case, the following options are particularly useful:

Add Meta-Clustering as New Parameter. By checking this check box, the meta-cluster assignment generated by this clustering step can be added as new parameter and thus be displayed on 1D plots, 2D plots, Plate Heatmap plots, and Data Grids.

Add Meta-Clustering to Selected Parameter. By checking this check box, the meta-cluster assignment generated by this clustering step can be visualized as a colored border around nodes, when the input parameter is displayed on a Heatmap plot.

 

The following settings may also be customized by the user:

Number of Clusters. This defines the number of clusters which should be found.

Sampling Percentage. This defines the fraction of events sampled at every repetition.

Number of Samplings. This defines the number of repetition.

New Parameter Name. The name of the new parameter (i.e. the one containing the cluster assignment for each input data) can be specified in this field.

Clustering Algorithm. This option allows a user to define whether the consensus clustering should be performed using k-Means or Hierarchical clustering.

Generate New Random Seed. Setting a Seed allows to ensure reproducibility with algorithms that rely on random initialization. Changing the Seed is a good way to evaluate the impact of randomness on the clustering result.

 

pipeline_consensusclustering

 

Once the Consensus clustering is displayed on the Heat Map, the following actions can be performed on the Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis.

 

 

Hierarchical Clustering

 

Performs Single-linkage agglomerative clustering, on the input population, using the parameter(s) selected in the Parameter Options list (see image below). Parameters can be filtered or sorted to assist in the selection of parameters when multiple parameters are available in the template file. Using the Filter: field a user can remove unwanted parameters from view to simplify selection.  For example, if a user only wanted to select area parameters typing "-A" in the field would reduce the number of parameters seen in the parameter list. By right clicking in the parameters section, you can use Sort Ascending, Sort Descending, or Unsorted to easily manage parameter ordering and facilitate parameter selection.  In the right click menu, you can easily select parameters by utilizing Check All, Uncheck All, Check Selected, Uncheck Selected, Invert Selection on All.  The options Check Selected and Uncheck Selected allow for using the shift key or Ctrl key to multi-select parameters and check or uncheck them all simultaneously.

 

This clustering approach is quite computationally heavy, the highest number of objects that can be clustered is set to 8000. So, the Hierarchical Clustering step is suitable to further cluster clusters generated by other clustering steps (which number is usually lower than 8000). This procedure is also called Meta-clustering.

 

The output of the Hierarchical Clustering step is a list of Cluster Assignment, which is an internal label indicating the cluster membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heatmap plots, and Data Grid.

 

The Cluster Assignment can be used as Input parameter for the Minimum Spanning Tree pipeline step.

 

When the input parameter is a cluster assignment (common occurrence with this type of clustering), the current clustering steps perform a meta-clustering. In this case, the following options are particularly useful:

Add Meta-Clustering as New Parameter. By checking this check box, the meta-cluster assignment generated by this clustering step can be added as new parameter and thus be displayed on 1D plots, 2D plots, Plate Heatmap plots, and Data Grids.

Add Meta-Clustering to Selected Parameter. By checking this check box, the meta-cluster assignment generated by this clustering step can be visualized as a colored border around nodes, when the input parameter is displayed on a Heatmap plot.

 

The following settings may also be customized by the user:

 

Number of Clusters. Defines the number of clusters to attempt to find.

Linkage Method. Possible options for linkage are:

oSingle. The distance between two clusters is defined by the smallest distance between a point in the first cluster and a point in the second clusters. Also referred to as the "nearest-neighbor" method.

oComplete. The distance between two clusters is defined by the larger distance between a point in the first cluster and a point in the second clusters. Also referred to as the "furthest-neighbor" method.

oAverage. The distance between two clusters is defined by the average distance between the points in the first cluster and the points in the second cluster.

oWard. The Ward method (also called minimum-variance method) minimizes the within-cluster variance. Two clusters are merged if the variance of the resulting cluster is the smallest variance possible as compared to the variance resulting from merging any other pair of clusters. The Ward method aims at finding compact, spherical clusters.

 

 

pipeline_hierarchical

 

Once the Hierarchical clustering is displayed on the Heat Map, the following actions can be performed on the Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis.

 

 

Kmeans

 

Performs Kmeans clustering, on the input population, using the parameter(s) selected in the Parameter Options list (see image below). Parameters can be filtered or sorted to assist in the selection of parameters when multiple parameters are available in the template file. Using the Filter: field a user can remove unwanted parameters from view to simplify selection.  For example, if a user only wanted to select area parameters typing "-A" in the field would reduce the number of parameters seen in the parameter list. By right clicking in the parameters section, you can use Sort Ascending, Sort Descending, or Unsorted to easily manage parameter ordering and facilitate parameter selection.  In the right click menu, you can easily select parameters by utilizing Check All, Uncheck All, Check Selected, Uncheck Selected, Invert Selection on All.  The options Check Selected and Uncheck Selected allow for using the shift key or Ctrl key to multi-select parameters and check or uncheck them all simultaneously.

 

The following settings may also be customized by the user (please refer to the Defining a k-Means cluster analysis chapter for more details):

Number of Clusters

Min percent Cells Per Cluster

Maximum Iterations

Number of Attempts

Keep-Non-Converged Data

 

The output of the Kmeans step is a list of Cluster Assignment, which is an internal label indicating the cluster membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heatmap plots, and Data Grid.

The Cluster Assignment can be used as Input parameter for the Minimum Spanning Tree pipeline step.

 

When the input parameter is a cluster assignment, the current clustering steps perform a meta-clustering. In this case, the following options are particularly useful:

Add Meta-Clustering as New Parameter. By checking this check box, the meta-cluster assignment generated by this clustering step can be added as new parameter and thus be displayed on 1D plots, 2D plots, Plate Heatmap plots and Data Grid.

Add Meta-Clustering to Selected Parameter. By checking this check box, the meta-cluster assignment generated by this clustering step can be visualized as a colored border around nodes, when the input parameter is displayed on an Heatmap plot.

 

Finally, the Kmeans algorithm is stochastic. To make it reproducible, a fixed Seed may be set. If the same dataset and the setting are used, by retaining the same Random Seed value, the same result will be achieved. Kmeans may be run multiple times with different Random Seeds to evaluate the stability and the consistency of the result.

 

pipeline_Kmeans

 

Once the k-Means clustering is displayed on the Heat Map, the following actions can be performed on the Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis.

 

 

Self-Organizing Map

 

This step performs SOM clustering on the input population, using the parameters selected in the Parameter Options list (see image below). Parameters can be filtered or sorted to assist in the selection of parameters when multiple parameters are available in the template file. Using the Filter: field a user can remove unwanted parameters from view to simplify selection.  For example, if a user only wanted to select area parameters typing "-A" in the field would reduce the number of parameters seen in the parameter list. By right clicking in the parameters section, you can use Sort Ascending, Sort Descending, or Unsorted to easily manage parameter ordering and facilitate parameter selection.  In the right click menu, you can easily select parameters by utilizing Check All, Uncheck All, Check Selected, Uncheck Selected, Invert Selection on All.  The options Check Selected and Uncheck Selected allow for using the shift key or Ctrl key to multi-select parameters and check or uncheck them all simultaneously.

 

The output of the SOM (Self-Organizing Map) step is a list of Cluster Assignment, which is an internal label indicating the SOM node membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heatmap plots, and Data Grid.

 

The output parameter of the SOM pipeline step can be used as input parameter for the Minimum Spanning Tree pipeline step.

 

Once the SOM clustering is displayed on the Plate Heat Map, the following actions can be performed on the Plate Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis.

 

 

Some background on Self-Organizing Map (i.e. SOM) can be found below, together with a description of all the settings available in FCS Express to customize the SOM calculations.

 

Self-Organizing Map (i.e. SOM) is a special class of artificial neural networks first introduced by Teuvo Kohonen and is used extensively as a clustering, dimensionality-reduction and visualization tool in exploratory data analysis.

The input and the output data are defined in the same high-dimensional space but the number of output data k (i.e. "nodes" or "neurons") is smaller than the number of input data n. For the sake of simplicity, output nodes can be thought of as centroids in the k-Means clustering and much like k-Means, output nodes for SOM are generally randomly initialized at the initialization of the SOM process.

 

One of the key points of SOM is that, output nodes also exist in a bi-dimensional space to form a grid in which each node is connected to its neighborhood nodes. The size of the grid is defined by the user (see the Grid Width and the Grid Height options in the image below; the default setting for a standalone SOM step is a 7x7 grid but when the SOM step is added via the FlowSOM pre-defined algorithm, the default setting is a 10x10 grid).

 

The basic steps of Kohonen’s SOM algorithm can be summarized by the following iterative procedure:

1.Initialization. The location in the high-dimensional space for each of the k output nodes is initialized by picking k random points in high dimensional space. They can be k random locations of the space, generally chosen in a small ‘ball’, or k random points from the input dataset (see the Cluster Centroid Initialization Method in the table below).

2.Sampling. Randomly select a point from the input dataset to present to the SOM nodes.

3.Similarity Matching. Compute the Euclidean distances between the selected point and each SOM node to find the nearest node (i.e. the winner; also called Best Matching Unit, or BMU).

4.Weight Updating. The location of all SOM nodes in the high dimensional space is updated so that nodes move towards the point sampled in Step 2. Each node is updated by a fraction of the current sample point. Said fraction is calculated for each node by a Learning function, which is a function of both the 2D distance to all other nodes and a learning rate(see below). The Learning function is a decreasing function which decreases at every iteration t (see Step 5).

5.Repeat Step 2, 3 and 4 until training is complete .

 

The Learning function in Step 4 above defines the amplitude of the movement for each node of the 2D grid towards the sample point (i.e. in other words defines how the 2D grid "learns" the shape of the multidimensional dataset). A generalized formula for this function is the following:

 

Learning = DistHD * α(t) *  h(t)

 

Where DistHD is the distance, calculated in the high-dimensional space, between the sample point and the node, α(t) is the Learning Rate function and h(t) is the Neighborhood function. Both the Learning Rate and the Neighborhood Functions exist in the range 0 to 1 and are decreasing functions (i.e. they both decrease at every iteration t). Since the Neighborhood function is always equal to 1 for the winner node (see below), the Learning Rate function can be viewed as the as the fraction of the distance covered by the winner node towards the sampled point. For every other nodes, the distance covered towards the sampled points will be equal to, or shorter than that (because of the further penalty caused by h(t), which is based on distance of the nodes to the winner node).

 

 

Parameter Options

SOM will be run on the population selected in the main pipeline body, using the parameter(s) selected in the Parameter Options list (see image below). Parameters can be selected individually by clicking on the boxes or by using the right click menu. The right click menu gives the user the option to Select All, Deselect All, or Invert Selection.

 

 

Transformation Options

 

Add Meta-Clustering...

The option allows users to control how the clustering information is accessed when the clustering is performed as Meta-Clustering (i.e. when clusters, and not single cell events, are clustered). Two (non-mutually exclusive) options are available:

 

Add Meta-Clustering as New Parameter. By checking this check box, the meta-cluster assignment generated by the clustering step will be added as new parameters and thus be displayed on 1D plots, 2D plots, Plate Heat Maps and Data Grids.

 

Add Meta-Clustering to Selected Parameter. By checking this check box, the meta-cluster assignment generated by the clustering step can be visualized as a colored border around nodes, when the input parameter is displayed on an Plate Heat Map plot.

 

Meta-clusters may also be automatically gated using Heat Map Well Gates.

Training Set Size

Defines how many events from the dataset will be sampled and thus how many iterations will be run. Possible choices are:

When Absolute is selected, the defined number of sampling will be performed (default is 1000). An number higher than the total number of input events can also be set.

When Percent (default) is selected, the number of samplings is defined as a percentage of the total number of input events (default is 100 when SOM is inserted into a pipeline as single step, while it’s 1000 when SOM is inserted as part of the FlowSOM Pre-Defined Algorithm step). Values higher than 100 are allowed. E.g. 1000 means that the number of sampling will be 10 times higher than the number of total events in the input dataset.

SOM 2D Grid

Defines the size of the 2D grid (see description above).

Note: In FCS Express the 2D grid is always rectangular, which means that every node has at most 4 direct neighbors.

Learning Rate

This user-defined parameter (also called α; alpha) is a main parameter controlling the learning of SOM since it scales the contribution of the new information. Practically, it can be viewed as the fraction of the distance covered by the winner node towards the sampled point.

Learning rate ranges between 0 and 1, with both the Initial and the Final value being defined by the user.

The Learning rate decreases with time (i.e. at every iteration t) following on a Decay Function (see below).

Neighborhood Spread

This user-defined parameter (also called σ; sigma) defines the amplitude of the neighborhood around the winner node in the 2D grid. The Neighborhood Spread is one of the variables of the Neighborhood Function (see below).

Both the initial and the final value can be defined by the user via the Neighborhood Spread fields (see image below) or automatically calculated by checking the Automatic Neighborhood Spread check box. The automatic calculation uses the 67th percentile as the initial value of the Cumulative Distribution Function for the entire set of fixed distances on the 2D map and 0 as Final value. Distances can be calculated using either the Euclidean or the Chebyshev method (see 2D Grid Distance Metric below).

As with the Learning Rate, the neighborhood spread also decreases with time (i.e. at every iteration t) following on a Decay Function (see below).

New Parameter Name

The name of the new parameter (i.e. the one containing the cluster assignment for each input data) can be specified in the field.

Cluster centroid initialization Method

Defines how the centroid initialization occurs.

When Random is selected, centroids will be initialized with random values between 0.1 and 1. The idea of this approach is to let mapping evolve with the data.  

When Random Cells is selected, centroids will be initialized with random points (i.e. cells) from the input dataset.

Training Decay Function

Both the Learning Rate α and the Neighborhood Spread σ values decrease with time following this function.

When Asymptotic is selected, both α and σ decrease proportionally to 1/t, where t is the current iteration.

When Linear is selected, both α and σ decrease proportionally to t/T, where t is the current iteration and T is the total number of Iterations that has to be performed.

2D Grid Neighborhood Function

Allows to vary the learning for the different nodes of the 2D grid based on their distance to the winner node. Ideally, the farther a node is from the winner, the less it will learn. The farther a node is from the winner, the lower will be the magnitude of the movement of that node toward the sampled point.

The 2D Grid Neighborhood Function ( h(t) ) can be either a Gaussian or Boxcar (also known as Bubble) function.

 

When Gaussian is selected the following formula will be used:

 

h(t) = exp (- Dist2D2 / 2σ(t)2 )

 

Where Dist2D is the distance on the 2D grid between the node and the winner node. σ(t) is the Neighborhood Spread, which starts at the value defined in the Initial Neighborhood Spread field (see above) and decreases following a Decay Function (see below) until reaching the value defined in the Final Neighborhood Spread field.

By carefully looking at the above formula, the user can see that, since Dist2D is 0 for the winner node, the value for h(t) for the winner node is always 1.

 

When Boxcar is selected the following formula will be used:

 

h(t)=1 when Dist2D ≤σ(t)

h(t)=0 when Dist2D >σ(t)

 

Where Dist2D is still the distance on the 2D grid between the node and the winner node and σ(t) is the Neighborhood Spread. When using Boxcar every node within the neighborhood will move as the winner node does. Every node outside the neighborhood will not move.

2D Grid Distance Metric

Defines the metric used to calculate distances on the 2D grid.

Provided that two horizontally-aligned adjacent nodes, or two vertically-aligned adjacent nodes, are 1 unit apart, the distance between a node and every other node can be calculated as either the Euclidean or the Chebyshev distance. With Euclidean distance every node has at most 4 direct neighbors, while with Chebyshev distance every node (with rectangular 2D grid) has at most 8 direct neighbors.

Generate New Random Seed

The Random Seed is a number which is used by the random number generator to generate the N random values. The random number generator will use the Random Seed value and perform a series of mathematical operations on it which result in N random values. 

Since SOM relies on a random initialization step to initialize the nodes into the multi-dimensional space, a seed can be set to make results reproducible over the time when SOM is applied to the same data. The seed can be changed by either clicking on the Change Random Seed button or by manually inserting a seed in the field.

In FCS Express, the default Random Seed is 6.

 

 

 

pipeline_SOM

 

The output of the SOM step is a list of Cluster Assignment, which is an internal label indicating the SOM node membership that is automatically assigned to each event. The cluster assignment can be displayed by the user on 1D plots, 2D plots, Plate Heat Map plots, and Data Grid.

 

The high-dimensional locations of the nodes can be used as Input parameter for the Minimum Spanning Tree pipeline step.

 

The name of the new parameter created by the SOM algorithm can also be customized by the user via the "New Parameter Name" field.

 

The SOM algorithm is stochastic. To make it reproducible, a fixed Seed may be set. If the same dataset and the settings are used, by retaining the same Random Seed value, the same result will be achieved. SOM may be run multiple times with different Random Seeds to evaluate the stability and the consistency of the result.

 

Once the SOM clustering is displayed on the Plate Heat Map, the following actions can be performed on the Plate Heat Map:

Modify the well size

Have well size dependent on a statistic such as number of events in the node

Set the Parameter and Statistic for display

Change the Color Level and Color Scheme

Create Well gates to select one or more clusters (i.e. wells) and use those gates/clusters for downstream analysis. Meta-clusters may also be automatically gated using Heat Map Well Gates