Page MenuHomePhabricator

Testing python sigclust (relationship between full cluster & damaging clusters)
Closed, ResolvedPublic

Event Timeline

aetilley claimed this task.
aetilley raised the priority of this task from to Needs Triage.
aetilley updated the task description. (Show Details)
aetilley subscribed.

@aetilley, I got an update from you in IRC yesterday. Can you add some notes here too? We can probably mark this done then.

  1. See recently added script R_read.R in sigclust/enwiki_data (see https://github.com/aetilley/sigclust). Call source("R_read.R") in R (inside the "enwiki_data" directory) to apply sigclust to the enwiki data (now titled "data2.tsv" ) as well as some other artificial data.
  1. Getting the following warning: http://stackoverflow.com/questions/21382681/kmeans-quick-transfer-stage-steps-exceeded-maximum
  1. Lastly, from a recent email to a R sigclust maintainer:

"Dear Steve,

I'm running R sigclust on a data matrix X with about 20,000 rows (samples) and 46 columns (features).

This matrix is somewhat sparse: About half of the entries are zero. This means the MAD(X) is zero and there is effectively no dimensionality reduction.

I'm not sure this last fact is what is contributing to the particular behavior, but

  1. The cluster index for the data X is consistently around .45
  1. The simulated cluster indices invariably have mean about .385 with a standard deviation of about .0025.

How should we interpret the fact that the data's cluster index is so much higher than any of the simulated cluster indices, which tend to group tightly around their mean?

I've run sigclust on matrices of similar dimensions both

  1. With entries simulated from N(0,1) and
  1. With entries half simulated from N(0,1) and half from N(.5, 1)

In the both cases the cluster indices group tightly around the mean. In the first case there's not so much of a pattern as to where the data's cluster index falls with respect to the simulated indices. So p value is not always zero or one and sometimes is not near either zero or one.

In the second case the cluster index of the input data is regularly less than all simulated cluster indices, giving a p value of zero. This is what we'd hope and expect.

I suppose I think I know how to interpret the p value being always zero but I don't know how to interpret it being always equal to one."

  • New strategy: Throwing away the outliers
    • Very small amount of items in one cluster (2)
    • After removing outliers, the cluster fitness was better

Plan: Try iteratively removing outliers. Look for obvious stopping point.

Also @Ladsgroup will try clustering on this dataset to see if he gets good performance.

An important realization this week was that default pre-scaling of input data (mean centering and normalizing variance to 1) did away with the strange behavior or the simulated CIs being so much lower than the input data CI. The scaling has taken us from always getting a p-value of 1 for the main dataset to always getting a p-value of 0. Thus, we begin clustering.

New additions are

  1. The new script enwiki_data/similarity.py takes a user-defined predicate defining a subset of row indices of input data, and uses Jaccard indices to compare various subsets. The current default for this predicate is label[i]==1, so the default behavior is to consider the subpopulation of samples which were labelled damaging.
  1. The new function recclust defined in sigclust/sigclust.py takes as input a data matrix X and a cutoff point and recursively applies sigclust to subclusters of X until all remaining subclusters have p-value greater than the cutoff. recclust returns a dictionary with the following keys:

"prefix", "pval", "subclust0", "subclust1", and "ids"

"prefix" is a string representation for a path to the root cluster consisting of all samples in the input X.

"pval" is the p-value of cluster 'prefix'

"subclust{0,1}" are dictionaries fo the same form as this one, representing the two primary subclusters of the cluster 'prefix'. Note that these may be None if pval greater than threshold.

"ids" is a numpy array of keys to use to record the cluster elements themselves. Note that this may be None if pval is greater than threshold.

"tot" is the total number of subclusters of this cluster.

(Also note that the new files recclust_tr_0 and recclust_tr_1 in the subdirectory enwiki_data are the result of calling pickle.dump on the dictionaries returned by two calls to recclust on the enwiki data "enwiki_data/data2.tsv".)

This comment was removed by aetilley.
Halfak renamed this task from Compare python SigClust to R sigclust to Testing python sigclust (relationship between full cluster & damaging clusters).Nov 6 2015, 6:19 PM
Halfak moved this task from Completed to Backlog on the Machine-Learning-Team (Active Tasks) board.