Page MenuHomePhabricator

Translation bot broke core tests
Closed, ResolvedPublic

Event Timeline

Change 282067 had a related patch set uploaded (by MaxSem):
Unbreak tests

https://gerrit.wikimedia.org/r/282067

Actually, I'd like to investigate this further, to avoid recurrences.

Actually, I'd like to investigate this further, to avoid recurrences.

There is already a report on the general case, you may want to continue there.

@Nemo_bis There is T48833: Do not override additions in mediawiki extension export, is that what you mean with the general case?

In this case there was long time between the addition of and removal so it shouldn't have happened. It should only be a risk for commits merged between twn import and twn export which I think is usually less than hour.

@Raymond do you recall anything special about this case?

@Nikerabbit No idea what/why this happens. The new message was added on April, 5th with f3b2f2023a5978bc2585b7f9bcd9ccd41409be3a but neither en (https://translatewiki.net/wiki/MediaWiki:Api-error-was-deleted/en ) nor qqq ( https://translatewiki.net/wiki/MediaWiki:Api-error-was-deleted/qqq ) were imported into twn. Therefore the removal of qqq with the next export run was the logical - but wrong - follow up by our scripts.

Original change merged
Change has been successfully merged into the git repository. 04-06 05:06 UTC
My autosync which should not touch mediawiki at all runs:
[06/Apr/2016 04:50:56 +0000] /betawiki (ProcessMessageChanges) /srv/mediawiki/targets/production/extensions/Translate/scripts/processMessageChanges.php --safe-import --group=* --skipgroup=ext-*,core,mediawiki*
Raimond's autosync runs:
[06/Apr/2016 06:06:41 +0000] /betawiki (ProcessMessageChanges) /srv/mediawiki/targets/production/extensions/Translate/scripts/processMessageChanges.php --name mediawiki --group=core,ext-*,mediawiki* --quiet
Raimond does export:
[06/Apr/2016 20:32:52 +0000] raymond/raymond (CommandlineExport) /srv/mediawiki/targets/production/extensions/Translate/scripts/export.php --target . --group core --lang=* --skip test,aeb,be-x-old,crh,dk,en,fiu-vro,gan,gom,got,hif,kbd,kk,kk-cn,iu,kk-kz,kk-tr,ko-kp,ku,ku-arab,no,ruq,simple,sr,tg,tp,tt,ug,zh,zh-classical,zh-cn,zh-sg,zh-hk,zh-min-nan,zh-mo,zh-my,zh-tw,zh-yue,bbc,ady --threshold 18 --hours 200
Patch uploaded and merged:

Uploaded 2016-04-06 20:35
Updated	 2016-04-06 20:56
twn:/resources/caches/translatewiki.net$ TZ=utc ls *mediawiki*cdb* -l --full-time | tail -n 4
-rw-rw-r--+ 1 betawiki users  3567 2016-04-05 20:07:44.629254000 +0000 messagechanges.mediawiki.cdb-1459886901
-rw-rw-r--+ 1 betawiki users  4598 2016-04-06 06:12:48.537999000 +0000 messagechanges.mediawiki.cdb-1459924058
-rw-rw-r--+ 1 betawiki users  5273 2016-04-06 11:50:47.969999000 +0000 messagechanges.mediawiki.cdb-1459943470
-rw-rw-r--+ 1 betawiki users  2848 2016-04-07 05:56:37.033999000 +0000 messagechanges.mediawiki.cdb-1460008622
twn:/resources/caches/translatewiki.net$ TZ=utc ls *mediawiki*cdb* -l --full-time --time=ctime | tail -n 4
-rw-rw-r--+ 1 betawiki users  3567 2016-04-05 20:08:21.321254000 +0000 messagechanges.mediawiki.cdb-1459886901
-rw-rw-r--+ 1 betawiki users  4598 2016-04-06 06:27:38.121999000 +0000 messagechanges.mediawiki.cdb-1459924058
-rw-rw-r--+ 1 betawiki users  5273 2016-04-06 11:51:10.289999000 +0000 messagechanges.mediawiki.cdb-1459943470
-rw-rw-r--+ 1 betawiki users  2848 2016-04-07 05:57:02.617999000 +0000 messagechanges.mediawiki.cdb-1460008622

From here messagechanges.mediawiki.cdb-1459924058 looks to be the file where the changes where supposed to be. The unix timestamp is 2016-04-06 06:27:38 UTC which means the time when the file was processed and is same as ctime. The date in the listing should be when it was written out be the script.

grep confirms this is the case

grep api-error-was-deleted *mediawiki*cdb*
Binary file messagechanges.mediawiki.cdb-1459924058 matches

I used the following script to dump the file:

$x = \Cdb\Reader::open( 'messagechanges.mediawiki.cdb-1459924058' );
$x->firstkey();
while( ($k = $x->nextkey()) !== false ) { var_dump( $k, unserialize( $x->get( $k ) ) ); }

The output is

string(4) "core"
array(3) {
  ["en"]=>
  array(1) {
    ["addition"]=>
    array(1) {
      [0]=>
      array(2) {
        ["key"]=>
        string(21) "api-error-was-deleted"
        ["content"]=>
        string(74) "A file of this name has been previously uploaded and subsequently deleted."
      }
    }
  }
  ["tl"]=>
  array(1) {
    ["change"]=>
    array(1) {
      [0]=>
      array(2) {
        ["key"]=>
        string(6) "upload"
        ["content"]=>
        string(22) "Mag-upload ng talaksan"
      }
    }
  }
  ["qqq"]=>
  array(1) {
    ["addition"]=>
    array(1) {
      [0]=>
      array(2) {
        ["key"]=>
        string(21) "api-error-was-deleted"
        ["content"]=>
        string(78) "API error message that can be used for client side localisation of API errors."
      }
    }
  }
}
string(17) "ext-googlegeocode" [SNIP]
string(16) "ext-kartographer" [SNIP]

So we can see that the change was picked up by the system. What happened after that is still mystery.

Okay, JobQueue has been broken for a while:

timetamp=20160406045841 (id=4554446,timestamp=20160406045841) t=0 good
<!DOCTYPE html>
<html><head><title>Database error - translatewiki.net</title><style>body { font-family: sans-serif; margin: 0; padding: 0.5em 2em; }</style></head><body>
<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database: <span dir=ltr>Can't connect to MySQL server on '
127.0.0.1' (111) (127.0.0.1:3306)</span>)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
        <input type="hidden" name="domains" value="https://translatewiki.net" />
        <input type="hidden" name="num" value="50" />
        <input type="hidden" name="ie" value="UTF-8" />
        <input type="hidden" name="oe" value="UTF-8" />

        <input type="text" name="q" size="31" maxlength="255" value="" />
        <input type="submit" name="btnG" value="Search" />
        <p>
                <label><input type="radio" name="sitesearch" value="https://translatewiki.net" checked="checked" />translatewiki.net</label>
                <label><input type="radio" name="sitesearch" value="" />WWW</label>
        </p>
</form></body></html>
<!DOCTYPE html>
<html><head><title>Internal error - translatewiki.net</title><style>body { font-family: sans-serif; margin: 0; padding: 0.5em 2em; }</style></head><body>
<p>[2bb61ce83432259b8f7dbed4] [no req]   JobQueueConnectionError from line 742 of /srv/mediawiki/tags/2016-03-31_19:10:56/includes/jobqueue/JobQueueDB.php: DBConnectionError:DB connection error: Can'
t connect to MySQL server on '127.0.0.1' (111) (127.0.0.1:3306)</p><p>Backtrace:</p><p>#0 /srv/mediawiki/tags/2016-03-31_19:10:56/includes/jobqueue/JobQueueDB.php(595): JobQueueDB-&gt;getSlaveDB()<br
 />
#1 /srv/mediawiki/tags/2016-03-31_19:10:56/includes/jobqueue/JobQueue.php(640): JobQueueDB-&gt;doGetSiblingQueuesWithJobs(array)<br />
#2 /srv/mediawiki/tags/2016-03-31_19:10:56/includes/jobqueue/JobQueueGroup.php(313): JobQueue-&gt;getSiblingQueuesWithJobs(array)<br />
#3 /srv/mediawiki/tags/2016-03-31_19:10:56/includes/jobqueue/JobQueueGroup.php(197): JobQueueGroup-&gt;getQueuesWithJobs()<br />
#4 /srv/mediawiki/tags/2016-03-31_19:10:56/includes/jobqueue/JobRunner.php(154): JobQueueGroup-&gt;pop(integer, integer, array)<br />
#5 /srv/mediawiki/tags/2016-03-31_19:10:56/maintenance/runJobs.php(93): JobRunner-&gt;run(array)<br />
#6 /srv/mediawiki/tags/2016-03-31_19:10:56/maintenance/doMaintenance.php(111): RunJobs-&gt;execute()<br />
#7 /srv/mediawiki/tags/2016-03-31_19:10:56/maintenance/runJobs.php(127): include(string)<br />
#8 {main}</p>

sudo service mw-jobrunner status
mw-jobrunner stop/waiting

It is not shown in service --status-all at all.

Nothing in the logs, is upstart really not logging service failures anywhere?

Anyway, started jobrunner manually. Looks like we need some kind of monitoring to prevent this happening in the future.

Change 282485 had a related patch set uploaded (by Nikerabbit):
Add stupidly simple check to alert if JobQueue is not running

https://gerrit.wikimedia.org/r/282485

Change 282485 merged by jenkins-bot:
Add stupidly simple check to alert if JobQueue is not running

https://gerrit.wikimedia.org/r/282485

Nikerabbit claimed this task.
Nikerabbit triaged this task as Medium priority.
Nikerabbit removed a project: Patch-For-Review.

This issue should not happen again. General issue still exists but has a separate task.

Tests are broken again today. This time because userjsispublic and usercssispublic have gone missing in rMWb3108317000f: Localisation updates from https://translatewiki.net..

I did not find clues what could have caused this this time. Can we close this again?