⚓ T116403 Testing python sigclust (relationship between full cluster & damaging clusters)

aetilley created this task.Oct 23 2015, 5:46 PM

aetilley claimed this task.

aetilley raised the priority of this task from to Needs Triage.

aetilley updated the task description. (Show Details)

aetilley subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2015, 5:46 PM

aetilley added a project: Machine-Learning-Team (Active Tasks).Oct 23 2015, 5:50 PM

aetilley set Security to None.

aetilley moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.

@aetilley, I got an update from you in IRC yesterday. Can you add some notes here too? We can probably mark this done then.

See recently added script R_read.R in sigclust/enwiki_data (see https://github.com/aetilley/sigclust). Call source("R_read.R") in R (inside the "enwiki_data" directory) to apply sigclust to the enwiki data (now titled "data2.tsv" ) as well as some other artificial data.

Getting the following warning: http://stackoverflow.com/questions/21382681/kmeans-quick-transfer-stage-steps-exceeded-maximum

Lastly, from a recent email to a R sigclust maintainer:

"Dear Steve,

I'm running R sigclust on a data matrix X with about 20,000 rows (samples) and 46 columns (features).

This matrix is somewhat sparse: About half of the entries are zero. This means the MAD(X) is zero and there is effectively no dimensionality reduction.

I'm not sure this last fact is what is contributing to the particular behavior, but

The cluster index for the data X is consistently around .45

The simulated cluster indices invariably have mean about .385 with a standard deviation of about .0025.

How should we interpret the fact that the data's cluster index is so much higher than any of the simulated cluster indices, which tend to group tightly around their mean?

I've run sigclust on matrices of similar dimensions both

With entries simulated from N(0,1) and

With entries half simulated from N(0,1) and half from N(.5, 1)

In the both cases the cluster indices group tightly around the mean. In the first case there's not so much of a pattern as to where the data's cluster index falls with respect to the simulated indices. So p value is not always zero or one and sometimes is not near either zero or one.

In the second case the cluster index of the input data is regularly less than all simulated cluster indices, giving a p value of zero. This is what we'd hope and expect.

I suppose I think I know how to interpret the p value being always zero but I don't know how to interpret it being always equal to one."

aetilley moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Oct 30 2015, 5:26 PM

aetilley moved this task from Completed to Backlog on the Machine-Learning-Team (Active Tasks) board.

New strategy: Throwing away the outliers
- Very small amount of items in one cluster (2)
- After removing outliers, the cluster fitness was better

Plan: Try iteratively removing outliers. Look for obvious stopping point.

Also @Ladsgroup will try clustering on this dataset to see if he gets good performance.

An important realization this week was that default pre-scaling of input data (mean centering and normalizing variance to 1) did away with the strange behavior or the simulated CIs being so much lower than the input data CI. The scaling has taken us from always getting a p-value of 1 for the main dataset to always getting a p-value of 0. Thus, we begin clustering.

New additions are

The new script enwiki_data/similarity.py takes a user-defined predicate defining a subset of row indices of input data, and uses Jaccard indices to compare various subsets. The current default for this predicate is label[i]==1, so the default behavior is to consider the subpopulation of samples which were labelled damaging.

The new function recclust defined in sigclust/sigclust.py takes as input a data matrix X and a cutoff point and recursively applies sigclust to subclusters of X until all remaining subclusters have p-value greater than the cutoff. recclust returns a dictionary with the following keys:

"prefix", "pval", "subclust0", "subclust1", and "ids"

"prefix" is a string representation for a path to the root cluster consisting of all samples in the input X.

"pval" is the p-value of cluster 'prefix'

"subclust{0,1}" are dictionaries fo the same form as this one, representing the two primary subclusters of the cluster 'prefix'. Note that these may be None if pval greater than threshold.

"ids" is a numpy array of keys to use to record the cluster elements themselves. Note that this may be None if pval is greater than threshold.

"tot" is the total number of subclusters of this cluster.

(Also note that the new files recclust_tr_0 and recclust_tr_1 in the subdirectory enwiki_data are the result of calling pickle.dump on the dictionaries returned by two calls to recclust on the enwiki data "enwiki_data/data2.tsv".)

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 6 2015, 5:03 PM

https://github.com/aetilley/sigclust

Halfak renamed this task from Compare python SigClust to R sigclust to Testing python sigclust (relationship between full cluster & damaging clusters).Nov 6 2015, 6:19 PM

Halfak moved this task from Completed to Backlog on the Machine-Learning-Team (Active Tasks) board.

Introducing soft thresholding in python sigclust:

https://github.com/aetilley/sigclust

Halfak moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Nov 13 2015, 6:13 PM

https://github.com/aetilley/sigclust/pull/1

ToAruShiroiNeko triaged this task as Medium priority.Nov 20 2015, 6:24 PM

Halfak closed this task as Resolved.Jan 21 2016, 3:40 PM

Testing python sigclust (relationship between full cluster & damaging clusters)
Closed, ResolvedPublic
Actions

Event Timeline

Testing python sigclust (relationship between full cluster & damaging clusters)Closed, ResolvedPublicActions

Event Timeline

Testing python sigclust (relationship between full cluster & damaging clusters)
Closed, ResolvedPublic
Actions