Jun 30 2016
Jan 17 2016
Jan 15 2016
Redirecting into Project
Jan 8 2016
Jan 1 2016
Dec 18 2015
PCFG object beta complete
Dec 11 2015
Looked at two more papers.
Dec 3 2015
Sorry, I just saw this. All done.
Nov 28 2015
Using large feature sets requires very large datasets to be effective, and the more subtle the content that you're trying to extract (e.g. "sneaky vandalism") the more difficult it is to extract this content from an editor's word choice.
Nov 20 2015
Nov 13 2015
Python and R sigclusts giving similar results on enwiki data. See R_read.R in /tests.
Nov 11 2015
Introducing soft thresholding in python sigclust:
Nov 6 2015
An important realization was that default pre-scaling of input data (mean centering and normalizing variance to 1) did away with the strange behavior or the simulated CIs being so much lower than the input data CI. The scaling has taken us from always getting a p-value of 1 for the main dataset to always getting a p-value of 0.
Python Sigclust and R sigclust gave similar results on enwiki_data.
An important realization this week was that default pre-scaling of input data (mean centering and normalizing variance to 1) did away with the strange behavior or the simulated CIs being so much lower than the input data CI. The scaling has taken us from always getting a p-value of 1 for the main dataset to always getting a p-value of 0. Thus, we begin clustering.
Nov 3 2015
I had understood that we were interesting in clustering edits generally. Thus I just dropped the last column. Aaron, which did you have in mind?
Nov 2 2015
The file data2.tsv has 19863 samples, your clusters sum to 802 samples. Let me look at the code you sent and get back to you.
Oct 30 2015
Oct 28 2015
- See recently added script R_read.R in sigclust/enwiki_data (see https://github.com/aetilley/sigclust). Call source("R_read.R") in R (inside the "enwiki_data" directory) to apply sigclust to the enwiki data (now titled "data2.tsv" ) as well as some other artificial data.
Oct 23 2015
Oct 16 2015
"Hard Thresholding" variant implemented.
Oct 9 2015
Converting algorithm summary into psuedo-code.
Sep 25 2015
Sep 18 2015
Sep 11 2015
The Sandvig paper did make brief mention of feedback mechanisms which seem to be pertinent to our considerations.
Tufekci's paper is mostly expository of other studies, but the studies that she mentions are truly fascinating. aetilley has never had a Facebook account, but was intrigued by the possibilities that Tufekci mentions.
Sandvig et. al. seem to be a diverse group of experts taking many pages to say something which is more or less obvious, but perhaps it bears repeating. There is a distinction between a function, an algorithm for computing a function, and a specific implementation of an algorithm. Racism, and bias in general can creep in at more than one level.
A mantra that kept coming to mind while reading these was "strive for open algorithms and open training sets." The principal barrier here is in determining the level of detail at which to describe an algorithm/dataset to a most likely non-technical user or in which to let said user specify their own personal algorithm.