Page MenuHomePhabricator

Duplicate clustering with old kmeans strategy
Closed, ResolvedPublic


See @aetilley's repo

Event Timeline

Halfak created this task.Oct 30 2015, 5:44 PM
Halfak assigned this task to Ladsgroup.
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Backlog on the Scoring-platform-team (Current) board.
Halfak added subscribers: Halfak, aetilley.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2015, 5:44 PM

From data2.tsv I get this ( 87 1s, 715 0s). Source code to reproduce on gist

The file data2.tsv has 19863 samples, your clusters sum to 802 samples. Let me look at the code you sent and get back to you.

Because we only test on reverted edits and the last column is reverted status (not a feature). I did this mistake initially too :)

I had understood that we were interesting in clustering edits generally. Thus I just dropped the last column. Aaron, which did you have in mind?

Halfak added a comment.Nov 6 2015, 6:10 PM

Responded in IRC. Do both! Cluster the entire set and also cluster just the damaging set and compare the difference.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 6 2015, 6:10 PM
Halfak closed this task as Resolved.Nov 19 2015, 11:46 PM
Halfak set Security to None.