Page MenuHomePhabricator

Generate spam and vandalism new page creation dataset
Closed, ResolvedPublic


We should be able to build a good model for detecting spam and vandalism in new page creations. We should be able to get good data by scanning the logging table for deletion reasons.

This task is done when we have a proposal for how to get data and some preliminary exploration of how we could model this.

Event Timeline

Copying over content from my personal documentation....

Overall plan -

  • Build an ML model based on ORES to go through new page creations (on EnWiki)
  • The model will detect spam/vandalism and deal with them on a “fast queue”
  • To build the model, we will get training datasets of articles = {Good, Bad}
    • Good = Articles which have not been deleted for at least 30 days
    • Bad = Articles that have been deleted within 30 days
    • Each Bad dataset will be sub-classified by the deletion reason
  • There will be approximately 20,000 pages in each dataset. Each page will be denoted by its first revision
    • All revisions within x minutes (mostly 2-5 mins) from the first revision might be also considered
    • We will choose the pages randomly from all pages created from a period of a full year
      • The time span chosen is 1 May 2015- 30 April 2016
      • We choose a timeframe of a full year so there are minimal effect due to “editing periods” like school vacations etc.
    • The number of pages in each dataset will be approximately proportional to the actual ratio of deleted/good articles.
    • There will be at least 1000 pages for each dataset ideally
  • Initially, we will use the following deletion reasons for our Bad subsets -
    • G1,G3,G10,G11,A11 under CSD
    • Based on the dataset obtained, we might include/exclude deletion reasons.
  • The namespaces we will check will be {0,118} (Article space and draft space).

Current Progress -

  • We will be querying the following tables on enwiki_p database - logging (for deletions), page, revision (for good dataset)
  • The following queries were tried using Quarry
  • From these, we have a basic format of the Queries to be run for our datasets.
  • Current plan is to have 20k articles in the good set, and 5k articles in each of the bad sets. We can later choose to truncate the overall dataset
  • Since the Quarry queries are timing out. We’ll use a Tool Labs account to run these queries.

First Two Queries attempted to test Tool Labs (tool name - "Sonitool")


cat query1.sql | sql enwiki_p > query1_op.tsv


USE enwiki_p;
SELECT page_id, page_title, first_rev_id
FROM (SELECT rev_page, MIN(rev_id) AS first_rev_id
	  FROM revision
	  LIMIT 1000) as sample
INNER JOIN page ON page_id = rev_page
WHERE page_namespace in (0,118);


cat query2.sql | sql enwiki_p > query2_op.tsv


USE enwiki_p;
FROM logging l
WHERE (l.log_type like "%delete%" && l.log_timestamp>20150000000000 && l.log_namespace in (0,118) && (l.log_comment like("%G11|%")) )
LIMIT 100;

Hi @Soni. Are you still interest in looking at this? I've suddenly got a bit more headspace to think about it.

Hey @Halfak . I actually am interested in going ahead with this. Just got my system fixed today so finally can actaully start with this again tonight or tomorrow night?

Halfak renamed this task from [Explore] Spam and Vandalism new page creation to Generate spam and vandalism new page creation dataset.Sep 26 2016, 11:10 PM

I've got about 3k concerning deletions per month. There are about 80k total article creations per month.

$ cat enwiki.draft_quality.201605.tsv | wc
  79213  396065 4269080
$ grep -v OK enwiki.draft_quality.201605.tsv | wc
   2972   14860  157536

Actually, it looks like it fluctuates a little bit

$ grep -v OK enwiki.draft_quality.201508.tsv | wc
   2374   11870  124306
$ grep -v OK enwiki.draft_quality.201509.tsv | wc
   2464   12320  129031
$ grep -v OK enwiki.draft_quality.201510.tsv | wc
   2490   12450  130169
$ grep -v OK enwiki.draft_quality.201511.tsv | wc
   2508   12540  134642
$ grep -v OK enwiki.draft_quality.201512.tsv | wc
   2142   10710  113433
$ grep -v OK enwiki.draft_quality.201601.tsv | wc
   2585   12925  140562
$ grep -v OK enwiki.draft_quality.201602.tsv | wc
   2447   12234  132535
$ grep -v OK enwiki.draft_quality.201603.tsv | wc
   2686   13430  144353
$ grep -v OK enwiki.draft_quality.201604.tsv | wc
   2844   14220  155560
$ grep -v OK enwiki.draft_quality.201605.tsv | wc
   2972   14860  157536
$ wc datasets/enwiki.draft_quality.201508-201608.tsv
  909053  4544642 49006910 
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep -v OK | wc
  30078  150389 1606157

OK. 30k positives in 909k observations. That looks pretty good. We can probably trim the dataset down to 5k positives and 151k observations total. Let's looks the class breakdowns.

$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep attack | wc
   2515   12575  132315
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep hoax | wc
   2302   11510  115255
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep spam | wc
  18770   93850  983322
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep vandalism | wc
   6906   34529  399838

It looks like we get lots of examples of spam (G11: Unambiguous advertising or promotion) and few examples of hoax (A11. Obviously invented)

Woah... now that I look at the explanation for A11, that seems less concerning. It's not intended to be used to label hoaxes. Looks like we might want to drop that into the "OK" category.

Luckily, I could clean this all up on the command line with a little bit of sed. I also found that I was getting multiple rows for some of the deleted pages (probably because they were deleted more than once) so I used uniq to clean that up. Here's some new stats.

$ wc datasets/enwiki.draft_quality.201508-201608.tsv
  907427  4536512 48916102 
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep -v OK | wc
  26268  131339 1410292
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep attack | wc
   2451   12255  129112
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep spam | wc
  17704   88520  928705
$ cat datasets/enwiki.draft_quality.201508-201608.tsv | grep vandalism | wc
   6506   32529  375910

So thinking of down-sampling, and it looks like the "attack" class might get under-sampled if we go too low. It might make sense to just down-sample the "OK" class a bit. I think I'd like to balance it with the !OK classes. E.g. 26268/26268. That'll make the output probabilities a little bit funny, but it would also make our accuracy measures a bit more useful. So, I should probably train two models. One would predict a boolean (needs_speedy_review) which would be True for all !OK . The other would actually try to predict what type of !OK it is.