This will support the development of features for detecting spam, attack and vandalism new page creations for those who lack the rights to look at the (deleted) pages directly.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Halfak | T148038 [Epic] Build draft quality model (spam, vandalism, attack, or OK) | |||
| Resolved | Halfak | T150307 Create manually vetted dataset of spam/vandalism/attack pages |
Event Timeline
I'm working from the dataset generated in T135644: Generate spam and vandalism new page creation dataset to manually review the content of 75 deleted pages (25 spam, 25 vandalism, and 25 attack).
I just submitted the reviewed dataset to WMF Legal and Privacy. Spoiler alert, there's a lot of scary stuff in "attack" pages. We'll see how this goes.
I just received a response from WMF Legal and Privacy. I'll be responding to them an employing some censorship. I'll be replacing censored content with a structured comment of the form:
<!-- Censored: <explanation> -->
E.g. I'd replace a postal address with <!-- Censored: PII (Postal address) --> or a birthday with <!-- Censored: PII (Birthday) -->. Here, "PII" means "Personally identifiable information".
Halfaker, Aaron (2016): Deleted Wikipedia articles (spam/vandalism/attack). figshare.
https://dx.doi.org/10.6084/m9.figshare.4245035.v1
Retrieved: 19 37, Nov 21, 2016 (GMT)