Page MenuHomePhabricator

Generate daily diffs for recently changed categories
Closed, ResolvedPublic

Description

Using the script from T173774: Create script to dump recently changed categories, generate daily dumps of categories that were changed. This will allow to load only daily updates instead of reloading the whole category set (which with commonswiki can take significant time and stalls the updates for up to an hour now).

Event Timeline

Smalyshev triaged this task as Medium priority.Jun 27 2018, 8:16 PM
Smalyshev created this task.
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.

Change 378355 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Generate daily diffs for categories RDF

https://gerrit.wikimedia.org/r/378355

Smalyshev moved this task from Next to In review on the User-Smalyshev board.Jun 27 2018, 11:53 PM
Smalyshev moved this task from In review to Doing on the User-Smalyshev board.
Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Jun 29 2018, 8:29 PM

On testwiki and test2wiki I get the following, running for the last seven day interval (all other wikis run ok):

Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: SELECT  rc_timestamp,page_title,page_namespace,rc_title,rc_cur_id,pp_propname,cat_pages,cat_subcats,cat_files  FROM `recentchanges` FORCE INDEX (new_name_timestamp) LEFT JOIN `page_props` ON (pp_propname = 'hiddencat' AND (pp_page = rc_cur_id)) LEFT JOIN `category` ON ((cat_title = rc_title))   WHERE (rc_timestamp >= '20180625095244') AND (rc_timestamp < '20180702095244') AND rc_namespace = '14' AND rc_new = '0' AND rc_log_type = 'move' AND rc_type = '3'  ORDER BY rc_timestamp ASC LIMIT 200  
Function: BatchRowIterator::next
Error: 1054 Unknown column 'page_title' in 'field list'

Can we take care of that before this goes live? Even just removing those two wikis from the categoriedrdf db list would be ok.

See T198629 It turns out this is across many of the wikis, not just those two; probably the config settings are such that for test and test2 we get the error output displayed n the console is all. So if you could have a look? Thanks!

daniel added a comment.EditedJul 2 2018, 3:42 PM

@ArielGlen found the bug:

In https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/categoryChangesAsRdf.php$223 we have:

$tables += $extra_tables;

But += does not work as expected on indexed arrays in PHP. using array_merge instead fixes the issue.

$tables = array_merge( $tables, $extra_tables );

This never worked. A local test run immediately failed with an exception.

Change 443449 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Use array_merge to merge indexed arrays in categoryChangesAsRdf.php.

https://gerrit.wikimedia.org/r/443449

Change 443449 merged by jenkins-bot:
[mediawiki/core@master] Use array_merge to merge indexed arrays in categoryChangesAsRdf.php.

https://gerrit.wikimedia.org/r/443449

Change 445720 had a related patch set uploaded (by Zfilipin; owner: Smalyshev):
[operations/mediawiki-config@master] Remove labs wikis from the categories-rdf list, don't need them

https://gerrit.wikimedia.org/r/445720

Change 445720 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove labs wikis from the categories-rdf list, don't need them

https://gerrit.wikimedia.org/r/445720

Mentioned in SAL (#wikimedia-operations) [2018-07-19T11:46:15Z] <zfilipin@deploy1001> Synchronized dblists/categories-rdf.dblist: SWAT: [[gerrit:445720|Remove labs wikis from the categories-rdf list, dont need them (T198356)]] (duration: 00m 55s)

The dblist fix has been deployed, off to test the actual bash script now. Until now it's all been manual runs across the dblist with direct calls to the maintenance script.

Change 378355 merged by ArielGlenn:
[operations/puppet@production] Generate daily diffs for categories RDF

https://gerrit.wikimedia.org/r/378355

This is now deployed; I'll check tomorrow that the dailies ran ok, and we'll know about the fulls over the weekend.

Change 449994 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] category diffs: full ts dir is different than dailies ts dir

https://gerrit.wikimedia.org/r/449994

Change 449994 merged by ArielGlenn:
[operations/puppet@production] category diffs: full ts dir is different than dailies ts dir

https://gerrit.wikimedia.org/r/449994

I've run the script manually with the above change applied; results are available in the expected location.

@ArielGlenn thanks for your help! I'll watch now how it works over next week and then try to switch dump loading to dailies.

Smalyshev moved this task from In review to Doing on the User-Smalyshev board.Aug 2 2018, 9:55 PM
Smalyshev closed this task as Resolved.Aug 4 2018, 2:59 AM