Page MenuHomePhabricator

Create manually vetted dataset of spam/vandalism/attack pages
Closed, ResolvedPublic


This will support the development of features for detecting spam, attack and vandalism new page creations for those who lack the rights to look at the (deleted) pages directly.

Event Timeline

I'm working from the dataset generated in T135644: Generate spam and vandalism new page creation dataset to manually review the content of 75 deleted pages (25 spam, 25 vandalism, and 25 attack).

I just submitted the reviewed dataset to WMF Legal and Privacy. Spoiler alert, there's a lot of scary stuff in "attack" pages. We'll see how this goes.

I just received a response from WMF Legal and Privacy. I'll be responding to them an employing some censorship. I'll be replacing censored content with a structured comment of the form:

<!-- Censored: <explanation> -->

E.g. I'd replace a postal address with <!-- Censored: PII (Postal address) --> or a birthday with <!-- Censored: PII (Birthday) -->. Here, "PII" means "Personally identifiable information".

Halfaker, Aaron (2016): Deleted Wikipedia articles (spam/vandalism/attack). figshare.
Retrieved: 19 37, Nov 21, 2016 (GMT)