This will support the development of features for detecting spam, attack and vandalism new page creations for those who lack the rights to look at the (deleted) pages directly.
|Resolved||Halfak||T148038 [Epic] Build draft quality model (spam, vandalism, attack, or OK)|
|Resolved||Halfak||T150307 Create manually vetted dataset of spam/vandalism/attack pages|
I'm working from the dataset generated in T135644: Generate spam and vandalism new page creation dataset to manually review the content of 75 deleted pages (25 spam, 25 vandalism, and 25 attack).
I just received a response from WMF Legal and Privacy. I'll be responding to them an employing some censorship. I'll be replacing censored content with a structured comment of the form:
<!-- Censored: <explanation> -->
E.g. I'd replace a postal address with <!-- Censored: PII (Postal address) --> or a birthday with <!-- Censored: PII (Birthday) -->. Here, "PII" means "Personally identifiable information".