Page MenuHomePhabricator

Do pageslogging dumps in parallel pieces for at least wikidata, investigate rapid growth of logs
Closed, ResolvedPublic

Description

The XML dump of page and user log events is taking almost 18 hours for wikidatawiki. That's too long; split it into several pieces like the abstracts job. We can recombine them in a separate step, it's cheap enough for now.

While we're at it, see why the wikidata log file is almost 4 times larger than that for enwiki. Haven't we been here before? Are there a bunch of bot edits that are being autopatrolled or something?

Event Timeline

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.

Change 394857 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] ability to do xmlpageslogging several pieces at a time in parallel

https://gerrit.wikimedia.org/r/394857

On archive.org (https://archive.org/download/wikidatawiki-20170401) I see this:

wikidatawiki-20170401-pages-logging.xml.gz     19-Apr-2017 21:30        8.6G

So we've grown 2.4G since April. For comparison, en wiki's pages-logging file is currently 3.0G TOTAL.

T49415 is the ticket where we had various edits that were autpatrolled and filling up the logging table. Time to see what's filling it these days.

mildly formatted for readability:

/data/xmldatadumps/public/wikidatawiki/20171120$ zcat wikidatawiki-20171120-pages-logging.xml.gz | grep '<logitem>' | wc -l
599 803 623
/data/xmldatadumps/public/enwiki/20171120$ zcat enwiki-20171120-pages-logging.xml.gz | grep '<logitem>' | wc -l
85 612 898
/data/xmldatadumps/public/commonswiki/20171120$ zcat commonswiki-20171120-pages-logging.xml.gz | grep '<logitem>' | wc -l
242 217 237

When I check the patrol logs on wikidatawiki, there are nearly 500 autopatrol entries PER MINUTE. For comparison, enwiki peaks at around 15/min (patrolled, no autopatrol), dewiki peaks also at around 15/min (with autopatrol), commons at around 80/min (with autopatrol). The vast majority of both commons and wikidata autopatrol entries are bots.

ArielGlenn renamed this task from Do pageslogging dumps in parallel pieces for at least wikidata wiki, perhaps others to Do pageslogging dumps in parallel pieces for at least wikidata, investigate rapid growth of logs.Dec 4 2017, 9:54 AM

This changeset is ready to go; it can be merged at the end of the current run, so around December 31.

Change 399589 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] enable pagelogs to be dumped by several processes in parallel

https://gerrit.wikimedia.org/r/399589

ArielGlenn added a subscriber: hoo.Dec 21 2017, 9:28 AM

mildly formatted for readability:

/data/xmldatadumps/public/wikidatawiki/20171120$ zcat wikidatawiki-20171120-pages-logging.xml.gz | grep '<logitem>' | wc -l
599 803 623
/data/xmldatadumps/public/enwiki/20171120$ zcat enwiki-20171120-pages-logging.xml.gz | grep '<logitem>' | wc -l
85 612 898
/data/xmldatadumps/public/commonswiki/20171120$ zcat commonswiki-20171120-pages-logging.xml.gz | grep '<logitem>' | wc -l
242 217 237

When I check the patrol logs on wikidatawiki, there are nearly 500 autopatrol entries PER MINUTE. For comparison, enwiki peaks at around 15/min (patrolled, no autopatrol), dewiki peaks also at around 15/min (with autopatrol), commons at around 80/min (with autopatrol). The vast majority of both commons and wikidata autopatrol entries are bots.

I'm adding @hoo for this so we can get some eyeballs on the underlying issue: the logging table is getting huge.

Change 399589 merged by ArielGlenn:
[operations/puppet@production] enable pagelogs to be dumped by several processes in parallel

https://gerrit.wikimedia.org/r/399589

Change 394857 merged by ArielGlenn:
[operations/dumps@master] ability to do xmlpageslogging several pieces at a time in parallel

https://gerrit.wikimedia.org/r/394857

ArielGlenn closed this task as Resolved.Jan 16 2018, 12:19 PM

Page logs are being generated properly. As far as the size of the logs, it looks like that's being dealt with in T184485 so I'm going to close this.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Jan 16 2018, 12:21 PM