Duplicate clustering with old kmeans strategy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Oct 30 2015, 5:44 PM

Description

See @aetilley's repo

Halfak assigned this task to Ladsgroup.

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added subscribers: Halfak, aetilley.

From data2.tsv I get this ( 87 1s, 715 0s). Source code to reproduce on gist

The file data2.tsv has 19863 samples, your clusters sum to 802 samples. Let me look at the code you sent and get back to you.

Because we only test on reverted edits and the last column is reverted status (not a feature). I did this mistake initially too :)

I had understood that we were interesting in clustering edits generally. Thus I just dropped the last column. Aaron, which did you have in mind?

Responded in IRC. Do both! Cluster the entire set and also cluster just the damaging set and compare the difference.

Halfak closed this task as Resolved.Nov 19 2015, 11:46 PM

Halfak set Security to None.