Page MenuHomePhabricator

GWT infinite loops with wmf's new hhvm job runners
Closed, ResolvedPublic

Description

GWT seems to be currently unable to download any images. The tasker goes wild, see the logs with thousands of identical entries:
https://commons.wikimedia.org/w/index.php?title=Special:Log&offset=&limit=5000&type=gwtoolset&user=

(...)
18:38, 16 January 2015 Swiss National Library (talk | contribs) metadata job created. <h2>Step 4: Batch upload</h2>Metadata batch job created. Your metadata file will be analyzed shortly and each item will be uploaded to the wiki in a background process. You can check the page "<a href="/wiki/Special:NewFiles" title="Special:NewFiles" target="_blank">Special:NewFiles</a>" to see when they have been uploaded. Started with metadata record 1.
18:38, 16 January 2015 Swiss National Library (talk | contribs) metadata job created. <h2>Step 4: Batch upload</h2>Metadata batch job created. Your metadata file will be analyzed shortly and each item will be uploaded to the wiki in a background process. You can check the page "<a href="/wiki/Special:NewFiles" title="Special:NewFiles" target="_blank">Special:NewFiles</a>" to see when they have been uploaded. Started with metadata record 1.
18:38, 16 January 2015 Swiss National Library (talk | contribs) metadata job created. <h2>Step 4: Batch upload</h2>Metadata batch job created. Your metadata file will be analyzed shortly and each item will be uploaded to the wiki in a background process. You can check the page "<a href="/wiki/Special:NewFiles" title="Special:NewFiles" target="_blank">Special:NewFiles</a>" to see when they have been uploaded. Started with metadata record 1.
18:38, 16 January 2015 Swiss National Library (talk | contribs) metadata job created. <h2>Step 4: Batch upload</h2>Metadata batch job created. Your metadata file will be analyzed shortly and each item will be uploaded to the wiki in a background process. You can check the page "<a href="/wiki/Special:NewFiles" title="Special:NewFiles" target="_blank">Special:NewFiles</a>" to see when they have been uploaded. Started with metadata record 1.
18:38, 16 January 2015 Swiss National Library (talk | contribs) metadata job created. <h2>Step 4: Batch upload</h2>Metadata batch job created. Your metadata file will be analyzed shortly and each item will be uploaded to the wiki in a background process. You can check the page "<a href="/wiki/Special:NewFiles" title="Special:NewFiles" target="_blank">Special:NewFiles</a>" to see when they have been uploaded. Started with metadata record 1.
(...)

Event Timeline

Kelson raised the priority of this task from to High.
Kelson updated the task description. (Show Details)
Kelson added a subscriber: Kelson.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Can other GWT users confirm this or is the bug limited to that one job?

tgr@terbium:/usr/local/apache/common/php-1.25wmf14$ mwscript showJobs.php --wiki=commonswiki --group | grep -i gwt
gwtoolsetUploadMetadataJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
tgr@tin:~$ redis-cli -a $REDISPWD -h rdb1003 hget commonswiki:jobqueue:gwtoolsetUploadMetadataJob:h-data `redis-cli -a $REDISPWD -h rdb1003 lrange commonswiki:jobqueue:gwtoolsetUploadMetadataJob:l-unclaimed 0 10 | cut -d\" -f2`

unserialized result:

array(8) {
  ["type"]=>
  string(26) "gwtoolsetUploadMetadataJob"
  ["namespace"]=>
  int(2)
  ["title"]=>
  string(65) "Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba9e3d0bb53"
  ["params"]=>
  array(3) {
    ["attempts"]=>
    int(1)
    ["user-name"]=>
    string(22) "Swiss National Library"
    ["whitelisted-post"]=>
    array(30) {
      ["gwtoolset-category"]=>
      array(2) {
        [0]=>
        string(20) "Collection Gugelmann"
        [1]=>
        string(47) "Media contributed by the Swiss National Library"
      }
      ["gwtoolset-category-phrase"]=>
      array(1) {
        [0]=>
        string(0) ""
      }
      ["gwtoolset-category-metadata"]=>
      array(1) {
        [0]=>
        string(0) ""
      }
      ["gwtoolset-form"]=>
      string(16) "metadata-preview"
      ["gwtoolset-mediafile-throttle"]=>
      string(1) "5"
      ["gwtoolset-mediawiki-template-name"]=>
      string(11) "Information"
      ["gwtoolset-metadata-file-relative-path"]=>
      string(42) "4/c/c/4cc9e2068ff5d1796253d2622b5d83b2.xml"
      ["gwtoolset-metadata-file-sha1"]=>
      string(31) "okb09obkpp6i59a1ma66h8bgtcmopj4"
      ["gwtoolset-metadata-file-url"]=>
      string(0) ""
      ["gwtoolset-metadata-mapping-name"]=>
      string(0) ""
      ["gwtoolset-metadata-mapping-subpage"]=>
      string(17) "Metadata_Mappings"
      ["gwtoolset-metadata-mapping-url"]=>
      string(0) ""
      ["gwtoolset-metadata-namespace"]=>
      string(10) "GWToolset:"
      ["gwtoolset-partner-template-url"]=>
      string(0) ""
      ["gwtoolset-record-begin"]=>
      string(1) "1"
      ["gwtoolset-record-count"]=>
      string(3) "782"
      ["gwtoolset-record-element-name"]=>
      string(6) "record"
      ["gwtoolset-global-license"]=>
      string(0) ""
      ["wpEditToken"]=>
      string(42) "<redacted>"
      ["wpSummary"]=>
      string(0) ""
      ["author"]=>
      array(1) {
        [0]=>
        string(17) "MalerInZeichnerIn"
      }
      ["date"]=>
      array(1) {
        [0]=>
        string(19) "Entstehungszeitraum"
      }
      ["description"]=>
      array(4) {
        [0]=>
        string(9) "TitelName"
        [1]=>
        string(16) "Kurzbeschreibung"
        [2]=>
        string(3) "Ort"
        [3]=>
        string(7) "Technik"
      }
      ["other_fields"]=>
      array(1) {
        [0]=>
        string(0) ""
      }
      ["other_fields_1"]=>
      array(1) {
        [0]=>
        string(0) ""
      }
      ["other_versions"]=>
      array(1) {
        [0]=>
        string(0) ""
      }
      ["permission"]=>
      array(1) {
        [0]=>
        string(10) "permission"
      }
      ["source"]=>
      array(2) {
        [0]=>
        string(6) "Source"
        [1]=>
        string(3) "URL"
      }
      ["gwtoolset-title"]=>
      array(1) {
        [0]=>
        string(14) "GWToolsetTitle"
      }
      ["gwtoolset-url-to-the-media-file"]=>
      array(1) {
        [0]=>
        string(10) "ToolsetURL"
      }
    }
  }
  ["rtimestamp"]=>
  int(0)
  ["uuid"]=>
  string(32) "dd7e6233802e49bda86ec1173897a8f6"
  ["sha1"]=>
  string(0) ""
  ["timestamp"]=>
  int(1421516349)
}

I'm myself not able to upload pictures using GWT, like for the user "Swiss_National_Library", my upload does not occur, see:
https://commons.wikimedia.org/w/index.php?title=Special:Log&offset=&limit=500&type=gwtoolset&user=

Stopped the job. If this reoccurs, run

redis-cli -a $REDISPWD -h rdb1003 < /home/tgr/redis-gwtoolset-clear.txt

a few times to do it (see here how to get the password).

Not sure if that was the proper way to kill a job, but it's Saturday, the ops channel is near empty, and it was producing half a million log entries per day and swamping RC on Commons.

runJobs.log snippet:

2015-01-17 07:09:13 mw1003 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99b063b attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=d4658ce162854eb5a5be5dbaef15ce8d,QueuePartition=rdb2) STARTING
2015-01-17 07:09:13 mw1014 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99dfcaf attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=ddc3c552ac01427e881d9a2d25e64df4,QueuePartition=rdb2) STARTING
2015-01-17 07:09:13 mw1003 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99b063b attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=d4658ce162854eb5a5be5dbaef15ce8d,QueuePartition=rdb2) t=44 good
2015-01-17 07:09:13 mw1002 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99e7cab attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=a5d2f1db087f466b9585605156641433,QueuePartition=rdb1) STARTING
2015-01-17 07:09:13 mw1014 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99dfcaf attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=ddc3c552ac01427e881d9a2d25e64df4,QueuePartition=rdb2) t=49 good
2015-01-17 07:09:14 mw1002 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99e7cab attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=a5d2f1db087f466b9585605156641433,QueuePartition=rdb1) t=42 good
2015-01-17 07:09:14 mw1001 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99f05c8 attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=ce6f1321a3e74dc5a0bab85f14028c30,QueuePartition=rdb2) STARTING
2015-01-17 07:09:14 mw1001 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a99f05c8 attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=ce6f1321a3e74dc5a0bab85f14028c30,QueuePartition=rdb2) t=49 good
2015-01-17 07:09:14 mw1006 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a9a0da01 attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=a08a513c37d543cca2ce6a23ff1ee4e4,QueuePartition=rdb1) STARTING
2015-01-17 07:09:14 mw1006 commonswiki: gwtoolsetUploadMetadataJob User:Swiss_National_Library/GWToolset/Metadata_Batch_Job/54ba0a9a0da01 attempts=1 user-name=Swiss National Library whitelisted-post=array(30) (uuid=a08a513c37d543cca2ce6a23ff1ee4e4,QueuePartition=rdb1) t=56 good

The job IDs keep changing, so I assume this means that the job is continuously recreated. Based on the log, the job executes successfully so JobRunner is not the one recreating it. There are two places in GWT where new UploadMetadataJob objects are created, UploadMetadataJob::recreateMetadataJob() and MetadataMappingHandler::createMetadataBatchJob().

  • recreateMetadataJob is only invoked when the number of UploadMediafileJobs in queue is over 1000 - there were no such jobs, so this shouldn't have happened
  • createMetadataBatchJob an be called from MetadataMappingHandler::processRequest() but only in web mode (when the user is firing off the batch on the special page)
  • so the only possible code path through which this job rescheduling loop could have happened is UploadMediafileJob::run() -> UploadMediafileJob::processMetadata() -> MetadataMappingHandler::processRequest()-> MetadataMappingHandler::processMetadata() -> MetadataMappingHandler::createMetadataBatchJob().

The way this is supposed to work is that MetadataMappingHandler::processMetadata() calls XmlMappingHandler::processXml() which invokes XmlMappingHandler::processDOMElements() for each matching XML element (starting from position gwtoolset-record-begin, up to gwtoolset-mediafile-throttle times) and that function increases gwtoolset-record-begin so that those elements are not re-read again, and then reschedules itself with some delay. Apparently incrementing gwtoolset-record-begin fails somehow and the job continuously reschedules itself to start from position 0 (also it spawns no uploaders).

In the dump above, gwtoolset-mediafile-throttle is 5, gwtoolset-record-begin is 1 and gwtoolset-record-count is 782.

I thought I figured out how GWT generates storage path for the metadata file but

FileBackendGroup::singleton()->get('local-multiwrite')->fileExists(array('src'=>'mwstore://local-multiwrite/gwtoolset-metadata/Swiss_National_Library/4/c/c/4cc9e2068ff5d1796253d2622b5d83b2.xml'))

returns false on Commons.

@Kelson, can you share the XML file you used? At this point the simplest solution seems to debug in a dev wiki how GWT reads it - the problem appears to be that it cannot find the XML nodes for the files.

We used the following file

@Tgr,

is the constant PHP_SAPI defined in production?
when a job is run by the job runner, is that constant set to 'cli'?

line 441 and 565 of GWToolset/includes/Handlers/Forms/MetadataMappingHandler.php depend on that constant being available and set to 'cli' or not.

i was able to run the upload locally without issue, but i'm not running hhvvm. i haven't been able to re-visit the vagrant set-up to see if it works with gwtoolset yet, so i'm hoping you might be able to check the production environment to see what value PHP_SAPI is set to. the solution might just be to add to the PHP_SAPI test, maybe PHP_SAPI === 'hhvm' ...

tgr@terbium:~$ cd /usr/local/apache/common/php-1.25wmf14
tgr@terbium:/usr/local/apache/common/php-1.25wmf14$ mwscript eval.php commonswiki
> echo PHP_SAPI;
cli

Not sure how realistic that is or how well it matches the jobrunners, I'll look that up tomorrow.

The timing would make sense, HHVM was rolled out to the job runners last week. The docs claim it should be cli, though.

Well, assuming it's this straightforward:

krenair@mw1001:~$ php -a
Welcome to HipHop Debugger!
Type "help" or "?" for a complete list of commands.

Note: no server specified, debugging local scripts only.
If you want to connect to a server, launch with "-h" or use:
  [m]achine [c]onnect <servername>

hphpd> var_dump( PHP_SAPI );
var_dump( PHP_SAPI );
string(3) "cli"

Looks fine...

@Krenair,
thanks for checking it.

@Tgr,
thanks as well. i'll try and replicate the issue locally as soon as i can get some free time. the main issue i see at the moment is the log entry itself. that entry shouldn’t appear in the log. it’s the html that appears after successful creation of the initial metadata batch job and should only appear in the browser for the user, not as part of any job runs.

18:38, 16 January 2015 Swiss National Library (talk | contribs) metadata job created. <h2>Step 4: Batch upload</h2>Metadata batch job created. Your metadata file will be analyzed shortly and each item will be uploaded to the wiki in a background process. You can check the page "<a href="/wiki/Special:NewFiles" title="Special:NewFiles" target="_blank">Special:NewFiles</a>" to see when they have been uploaded. Started with metadata record 1.

that entry shouldn’t appear

Thanks for posting here @Steinsplitter. I've run the command in T87040#984282 again a few times to stop that one for now.

gerritbot added a subscriber: gerritbot.

Change 186237 had a related patch set uploaded (by Dan-nl):
GWToolset Special:Log Issue

https://gerrit.wikimedia.org/r/186237

Patch-For-Review

Change 186241 had a related patch set uploaded (by Brian Wolff):
Do not assume that PHP_SAPI is cli when running jobs

https://gerrit.wikimedia.org/r/186241

Patch-For-Review

Change 186237 abandoned by Dan-nl:
GWToolset Special:Log Issue

Reason:
it looks like the script can't rely on PHP_SAPI. bawolff has another solution that works: https://gerrit.wikimedia.org/r/#/c/186241

https://gerrit.wikimedia.org/r/186237

Change 186241 merged by jenkins-bot:
Do not assume that PHP_SAPI is cli when running jobs

https://gerrit.wikimedia.org/r/186241

@Kelson,
once https://gerrit.wikimedia.org/r/#/c/186241/ is deployed to production you can try the upload again. i suggest using a small subset to start and then the rest once you're able to verify that all is well.

Thank you all for doing your best to fix this issue quickly. I'll try again with a set of 3 pictures ASAP (after the rollout).

I test it this evening before reading this thread, and I'm sorry but I upload a set of 100 elements.
Pb seems to be the same as described before patch : look at https://commons.wikimedia.org/w/index.php?title=Special%3ALog&type=gwtoolset&user=Tounoki&page=&year=&month=-1&tagfilter=

More details here : https://lists.wikimedia.org/pipermail/glamtools/2015-January/000326.html

Log shows roughly every 31 minutes, since about 18:00. I ran the command again to stop it.

Note patch is only live as of 18:08 utc. Did you test before or after that?

Job queue could potentially be backed up from previous issue, or it might just still be broke (the patch was a guess, we couldnt confirm for sure that that was the full issue).

Ill note the infinite loop thing isnt happening currently, so thats good at least.

live? The patch is not deployed to WMF servers yet...

Oh whoops, i misread the bug number in the server admin log. Please ignore me :)

When do you think I could make a new test ?

We tried to reproduce the issue on deployment-prep before the patch was approved (so that we could check that the patch really fixed it). Unfortunately the job was not getting run (didn't know why), so the patch was approved anyway.

We could check it in production I guess, shouldn't break things any more than they already are. I'd want @greg to approve deploying it though.

@Tounoki,
you should be able to run a small test on beta first. if your test works on beta, then “in theory” it should work on production once GWToolset 0.3.8 (1fd5c4d) 16:55, 22 January 2015 is deployed there. you can check this page in production and look for GWToolset to see which version is on production.

@Krenair
the next deployment train for commons is group 1 on Tuesday, 03 February 2015. we definitely want to get https://gerrit.wikimedia.org/r/#/c/186241/ deployed before then if possible.

the next SWAT is on Monday, January 26 00:00–23:00 UTC, (Sun) 16:00–15:00 PST 01:00–00:00 UTC+1. the patch meets the SWAT criteria “Bugfixes/backports (in other words: Things that are already in master/deployed on the Beta Cluster)”. i don’t know yet how to “Prepare patches in gerrit against the current live wmfNN branches (or a subset if the bug is limited)”. do you or @Bawolff?

Assuming procedure hasnt changed, generally you cherry pick the patch to current wmf branch, have someone with deployment rights +2 that, then create a patch against wmf branch of mediawiki core updating the submodule reference.


I wonder if gwtoolset should be disabled until this patch can be deployed.

Bawolff renamed this task from GWT seems to be currently broken to GWT infinite loops with wmf's new hhvm job runners.Jan 23 2015, 12:12 PM

@Bawolff,
would you mind setting that cherry picked merge up?

i sent an email to the glamtools mailing list letting everyone know about the issue and asked them not to use GWToolset on production until this issue has been resolved. hopefully that will be enough.

@Bawolff will you be in town for the Dev Summit? I'd like to fix this as soon as possible. I haven't had a chance to look into the fix needed, but if it can be easily backported to the live MW versions (wmf13/14) then we should deploy it ASAP.

Change 186369 had a related patch set uploaded (by Brian Wolff):
Do not assume that PHP_SAPI is cli when running jobs

https://gerrit.wikimedia.org/r/186369

Patch-For-Review

Change 186370 had a related patch set uploaded (by Brian Wolff):
Do not assume that PHP_SAPI is cli when running jobs

https://gerrit.wikimedia.org/r/186370

Patch-For-Review

:( :(

@Gilles, can you shepherd this out? Do you agree it should go out early next week in a SWAT?

Yes, this seems serious enough to do on the 8am SWAT on Monday. I'll add it to the Deployments page.

Change 186369 merged by Gilles:
Do not assume that PHP_SAPI is cli when running jobs

https://gerrit.wikimedia.org/r/186369

Change 186370 merged by jenkins-bot:
Do not assume that PHP_SAPI is cli when running jobs

https://gerrit.wikimedia.org/r/186370

Change 186385 had a related patch set uploaded (by Gilles):
Backport GWToolset HHVM bugfix

https://gerrit.wikimedia.org/r/186385

Patch-For-Review

Change 186388 had a related patch set uploaded (by Gilles):
Backport GWToolset HHVM bugfix

https://gerrit.wikimedia.org/r/186388

Patch-For-Review

i don’t know yet how to “Prepare patches in gerrit against the current live wmfNN branches (or a subset if the bug is limited)”

For the record: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1b:_extension_changes

Change 186388 merged by jenkins-bot:
Backport GWToolset HHVM bugfix

https://gerrit.wikimedia.org/r/186388

Change 186385 merged by jenkins-bot:
Backport GWToolset HHVM bugfix

https://gerrit.wikimedia.org/r/186385

This is now deployed, please test it on Commons if you have the rights to do so.

Please go ahead with a complete set and let us know how it goes.

Looking at the log, the big job flood issue here seems resolved. There's still some weirdness with HTML in the logs, but perhaps we should open a new task for that?

I upload a set of 500 files. Work in progress but it seems to be ok.

MarkTraceur claimed this task.

Great work, team! Sorry it took so long.