Page MenuHomePhabricator

Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes
Closed, ResolvedPublic

Description

Incident report for some summary https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki

Summary

Some jobs such as account creation, account rename or Wikidata dispatcher invokes SiteConfiguration::getConfig() since they have to act on several wikis. That method shells out to mwscript maintenance/getConfiguration.php with ulimits being applied, most notably a file size limit of 512MBytes.

However, when HHVM runs the command, it tries to update the byte code cache (either /var/cache/hhvm/fcgi.sq3 or /var/cache/hhvm/cli.sq3). That causes a system error EFBIG (File too large) and the job fail.

See https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki incident report for actionables.

See below comments for debugging / technical details.

Original task

After pushing 1.28.0-wmf.19 to group1 (which includes wikidatawiki) the dispatch seems to have stopped entirely.

https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch
https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch-script

https://www.wikidata.org/wiki/Special:DispatchStats with wikidata on wmf.19 had:

Change log statistics

IDTimestamp
Oldest37378639419:45, 12 September 2016
Newest37449819219:48, 15 September 2016

Dispatch statistics

SitePositionPendingLagTouched
Freshesteswikibooks3744950083,18430 minutes19:18, 15 September 2016
Medianskwiki3744949273,26530 minutes19:17, 15 September 2016
Stalestfiwiki3744936394,55337 minutes19:48, 15 September 2016
Average--3,30430 minutes-

I reverted wikidata back to .18 and the dispatch got handled just fine. Stats view:

Change log statistics

IDTimestamp
Oldest37378976120:00, 12 September 2016
Newest37450096220:08, 15 September 2016

Dispatch statistics

SitePositionPendingLagTouched
Freshesttywiki374500941210 minutes20:08, 15 September 2016
Medianextwiki374500920420 minutes20:08, 15 September 2016
Stalestbnwikisource374500917450 minutes20:08, 15 September 2016
Average--340 minutes-

I have noticed:

 {
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 422,
  "function": "getConfig",
  "class": "SiteConfiguration",
  "type": "->",
  "args": [
    "string",
    "string"
  ]
},
{
  "function": "{closure}",
  "class": "JobQueueGroup",
  "type": "->",
  "args": [
    "boolean",
    "integer",
    "array",
    "NULL"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/libs/objectcache/WANObjectCache.php",
  "line": 987,
  "function": "call_user_func_array",
  "args": [
    "Closure",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/libs/objectcache/WANObjectCache.php",
  "line": 892,
  "function": "doGetWithSetCallback",
  "class": "WANObjectCache",
  "type": "->",
  "args": [
    "string",
    "integer",
    "Closure",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 425,
  "function": "getWithSetCallback",
  "class": "WANObjectCache",
  "type": "->",
  "args": [
    "string",
    "integer",
    "Closure",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 293,
  "function": "getCachedConfigVar",
  "class": "JobQueueGroup",
  "type": "->",
  "args": [
    "string"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 304,
  "function": "getQueueTypes",
  "class": "JobQueueGroup",
  "type": "->",
  "args": []
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 152,
  "function": "getDefaultQueueTypes",
  "class": "JobQueueGroup",
  "type": "->",
  "args": []
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/includes/Notifications/JobQueueChangeNotificationSender.php",
  "line": 59,
  "function": "push",
  "class": "JobQueueGroup",
  "type": "->",
  "args": [
    "Wikibase\\ChangeNotificationJob"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/includes/ChangeDispatcher.php",
  "line": 248,
  "function": "sendNotification",
  "class": "Wikibase\\Repo\\Notifications\\JobQueueChangeNotificationSender",
  "type": "->",
  "args": [
    "string",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php",
  "line": 210,
  "function": "dispatchTo",
  "class": "Wikibase\\Repo\\ChangeDispatcher",
  "type": "->",
  "args": [
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/maintenance/doMaintenance.php",
  "line": 110,
  "function": "execute",
  "class": "Wikibase\\DispatchChanges",
  "type": "->",
  "args": []
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php",
  "line": 271,
  "args": [
    "string"
  ],
  "function": "require_once"
},
{
  "file": "/srv/mediawiki/multiversion/MWScript.php",
  "line": 97,
  "args": [
    "string"
  ],
  "function": "require_once"
}

Details

Related Gerrit Patches:
mediawiki/core : masterDo not limit filesize when running a maintenance script
operations/mediawiki-config : masterInline doc for $wgMaxShell*
mediawiki/core : masterAvoid triggering SiteConfiguration lookup in JobQueueGroup::push()
mediawiki/core : wmf/1.28.0-wmf.19Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()
operations/mediawiki-config : masterAll wikis back to 1.28.0-wmf.18
operations/mediawiki-config : masterFix wikidata to .18 (previous was testwikidata)
operations/mediawiki-config : masterAll wiki but wikidatawiki to php-1.28.0-wmf.19
operations/mediawiki-config : masterwikidatawiki back to 1.28.0-wmf.18

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hashar added a comment.EditedSep 16 2016, 9:35 AM

Reproduction on terbium invoking the command against dewikinews:

$ sudo -u www-data /bin/bash /srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki dewikinews --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180'; echo "Exit code: $?"
[Fri Sep 16 09:35:03 2016] [hphp] [7563:7f494d3c4100:0:000001] [] Lost parent, LightProcess exiting
/usr/bin/timeout: the monitored command dumped core
[Fri Sep 16 09:35:03 2016] [hphp] [7564:7f494d3c4100:0:000001] [] Lost parent, LightProcess exiting
/srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh: line 101:  7556 File size limit exceeded/usr/bin/timeout $MW_WALL_CLOCK_LIMIT /bin/bash -c "$1" 3>&-

Exit code: 153

It works just fine with MW_FILE_SIZE_LIMIT=1048576

2016-09-15 19:59:54 [139a37a9f227e562aefeee8d] terbium wikidatawiki 1.28.0-wmf.19 exception ERROR: [139a37a9f227e562aefeee8d] [no req]   MWException from line 561 of /srv/mediawiki/php-1.28.0-wmf.19/includes/SiteConfiguration.php: Failed to run getConfiguration.php. {"exception_id":"139a37a9f227e562aefeee8d"}

Maybe T145839 is related:

[V9vCrwpEFhUAADfGIToAAAAC] /wiki/Special:CreateAccount MWException from line 561 of /srv/mediawiki/php-master/includes/SiteConfiguration.php: Failed to run getConfiguration.php.
...

And strace is again my friend. Running it with:

  • -y to show file descriptor names
  • -e trace=desc for system calls related to file descriptor
  • -s 2048 for large strings
[pid 25957] open("/var/cache/hhvm/cli.hhbc.sq3", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 7
[pid 25957] open("/var/cache/hhvm/cli.hhbc.sq3-journal", O_RDWR|O_CLOEXEC) = 8

That is sqlite journaling.

[pid 25957] read(8</var/cache/hhvm/cli.hhbc.sq3-journal>, ..., 1024 ) = 1024
[pid 25957] lseek(7</var/cache/hhvm/cli.hhbc.sq3>, 844421120, SEEK_SET) = 844421120
[pid 25957] write(7</var/cache/hhvm/cli.hhbc.sq3>, ... , 1024 ) = -1 EFBIG (File too large)
[pid 25957] --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=25957, si_uid=33} ---

[pid 25957] +++ killed by SIGXFSZ (core dumped) +++
[pid 25954] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=25957, si_status=SIGXFSZ, si_utime=3, si_stime=22} ---

Command for copy pasting:
sudo -u www-data strace -y -e trace=desc -s2048 -f /bin/bash /srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki dewikinews --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180'; echo "Exit code: $?"

Without strace but with MW_INCLUDE_STDERR=1

/srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh: line 101: 24089 File size limit exceeded/usr/bin/timeout $MW_WALL_CLOCK_LIMIT /bin/bash -c "$1" 3>&-

TLDR: when running mwscript on terbium the HHVM cache file /var/cache/hhvm/cli.hhbc.sq3 is committed to. Sqlite3 then use the journaling system to rewrite it and that explodes due to the ulimit -f.

hashar edited projects, added HHVM; removed Patch-For-Review.Sep 16 2016, 10:17 AM

SiteConfiguration::getConfig() uses wfShellWikiCmd() to craft the command it uses the PHP interpreter from $wgPhpCli, that is '/usr/bin/php' which on terbium is hhvm.

549             $retVal = 1;
550             $cmd = wfShellWikiCmd(
551                 "$IP/maintenance/getConfiguration.php",
552                 [
553                     '--wiki', $wiki,
554                     '--settings', implode( ' ', $settings ),
555                     '--format', 'PHP'
556                 ]
557             );

Then the call is made. Note the comment about ulimit5.sh breaking the call !!! The memory limit is set to zero explicitly :(

558             // ulimit5.sh breaks this call
559             $data = trim( wfShellExec( $cmd, $retVal, [], [ 'memory' => 0 ] ) );
560             if ( $retVal != 0 || !strlen( $data ) ) {
561                 throw new MWException( "Failed to run getConfiguration.php." );
562             }
563             $res = unserialize( $data );
564             if ( !is_array( $res ) ) {
565                 throw new MWException( "Failed to unserialize configuration array." );
566             }
567             $this->cfgCache[$wiki] = $this->cfgCache[$wiki] + $res;

A monkey patch would be to pass to wfShellExec the env 'filesize' => 0 to workaround the write to the HHVM cache file.

Unsolved

Why on 1.28.0.wmf-18 /var/cache/hhvm/cli.hhbc.sq3 is not updated/written to but it is on 1.28.0.wmf-19.

On the Beta-Cluster-Infrastructure

Trying with the ulimit at 512k

sudo -u www-data /bin/bash /srv/mediawiki/php-master/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki wikidatawiki --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180;'; echo "Exit code: $?"

OK deployment-mediawiki05, Jessie, HHVM 3.12.7+dfsg-1+wmf1
OK deployment-tin, Trusty, HHVM 3.12.1+dfsg-1~wmf2+trusty0

Upgraded HHVM on deployment-tin to 3.12.7+dfsg-1+wmf1~trusty1. It still write to the journal file and run fine.

I am trying in production now :(

OK mw1273, Jessie, 3.12.7+dfsg-1+wmf1

BAD terbium, Trusty, 3.12.7+dfsg-1+wmf1~trusty1

Running:

sudo -u www-data /bin/bash /srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki dewikinews --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180'; echo "Exit code: $?"

WORKS on tin
FAIL on terbium

On terbium I then delete /var/cache/hhvm/cli.hhbc.sq3 and the command now pass.

On terbium cli.hhbc.sq3 was 876 MBytes. It is now down to 17 MBytes.

Change 311118 had a related patch set uploaded (by Hashar):
Inline doc for $wgMaxShell*

https://gerrit.wikimedia.org/r/311118

Got a Failed to run getConfiguration.php. on mw1215.eqiad.wmnet:

$ ls -hl /var/cache/hhvm/
total 2.5G
-rw-r--r-- 1 www-data www-data 202M Sep 16 12:28 cli.hhbc.sq3
-rw-r--r-- 1 www-data www-data 2.3G Sep 16 04:10 fcgi.hhbc.sq3

That is from an api.php call. If the file limit stands true, cli.hhbc.sq3 is only 202Mbytes so I am not sure why that fails.

Mentioned in SAL (#wikimedia-operations) [2016-09-16T12:40:02Z] <hashar> Going to rollback all Wikis back to 1.28.0-wmf.18 . Despite much investigation, a bunch of jobs are broken due to T145819 which includes Special:CreateAccount :(

Change 311120 had a related patch set uploaded (by Hashar):
All wikis back to 1.28.0-wmf.18

https://gerrit.wikimedia.org/r/311120

Change 311120 merged by jenkins-bot:
All wikis back to 1.28.0-wmf.18

https://gerrit.wikimedia.org/r/311120

Mentioned in SAL (#wikimedia-operations) [2016-09-16T12:50:09Z] <hashar@tin> rebuilt wikiversions.php and synchronized wikiversions files: All wikis back to 1.28.0-wmf.18 :( T145819

From T145839#2643508 (task about Special:CreateAccount broken:


Sorry I took time to notice this bug was about Special:CreateAccount being broken. I though it was just yet a random job failling due to T145819. So I kept investigating that other task.

Eventually I have browsed the dashboard for authentication metrics at https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen Screenshot of the last 24 hours shows a 100% error rate on account creation:

I have reverted all wikis to 1.28.0-wmf.18 as a result which is reflected in the drop of errors at the far right of the graph above.

Sorry, should really have noticed that earlier.


All wikis are now back to 1.28.0-wmf.18. The Wikidata dispatch lag is all fine Account creation jobs are back as well.

I am enlarging the scope of this task. It is not just Wikidata but more about a variety of jobs that fails whenever shelling out to mwscript maintenance/getConfiguration.php.

Incident report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki

(Huge thanks to @Addshore for the support/pointers etc)

hashar renamed this task from Wikidata at 1.28.0-wmf.19 no more replicate to wikis (replag raise / dispatch stop) to Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes.Sep 16 2016, 2:55 PM
hashar removed a project: Patch-For-Review.
hashar updated the task description. (Show Details)

It looks like rMW6f9a246d25b2: Make JobQueueGroup::push() update the queuesHaveJobs() cache is what made job pushing start triggering T111441. It'll affect any code that pushes a job onto another wiki's job queue. @aaron might know if we could revert that patch until T111441 can be fixed, or if we can adjust the added code to only run if the job is being pushed on the local wiki.

Change 311168 had a related patch set uploaded (by Aaron Schulz):
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311168

Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 16 2016, 6:17 PM

Change 311172 had a related patch set uploaded (by Aaron Schulz):
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311172

Change 311172 merged by jenkins-bot:
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311172

Change 311168 merged by jenkins-bot:
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311168

hashar removed hashar as the assignee of this task.Sep 16 2016, 7:46 PM

Change 311118 merged by jenkins-bot:
Inline doc for $wgMaxShell*

https://gerrit.wikimedia.org/r/311118

A view of messages referencing /srv/mediawiki/php-1.28.0-wmf.19/includes/SiteConfiguration.php as referenced by @Addshore previously in https://logstash.wikimedia.org/goto/259821dc32242eb3fde0cd02755685f6

Fields:

  • server: name of the server. codfw / snapshot hosts have been removed (had 0 errors)
  • n_err number of lines in logstash
  • cli_size human readable size of /var/cache/hhvm/cli.hhvm.sq3
  • cli_date date of file as reported by ls
  • fcgi_size/fcgi_date same but for /var/cache/hhvm/fcgi.hhvm.sq3

I have starred has that table for a while. Cant find anything specific of interest but maybe someone else will?

Seems like the cgi cache files have been trashed/regenerated on some hosts.

servern_errcli_sizecli_datefcgi_sizefcgi_date
terbium32328MSep 19 12:1020MAug 31 12:24
mw10170139MSep 6 23:421.2GSep 19 11:12
mw10990112MSep 14 23:221.2GSep 19 11:13
mw1152041MAug 31 10:3223MSep 19 11:27
mw116110622MSep 19 06:271.1GSep 19 11:12
mw11626622MSep 19 06:271.1GSep 19 11:32
mw11637221MSep 19 06:271.1GSep 19 11:12
mw11649221MSep 19 06:271.1GSep 19 11:13
mw11658733MSep 19 06:271.1GSep 19 11:13
mw11668921MSep 19 06:271.1GSep 19 11:12
mw116710722MSep 19 06:271.1GSep 19 11:12
mw11687621MSep 19 06:271.1GSep 19 11:12
mw11697722MSep 19 06:271.1GSep 19 11:13
mw1170184206MSep 19 12:392.3GSep 19 11:13
mw1171170277MSep 19 12:396.7GSep 19 11:12
mw1172157208MSep 19 12:392.3GSep 19 12:22
mw1173176210MSep 19 12:392.3GSep 19 11:13
mw1174162284MSep 19 12:396.6GSep 19 11:12
mw1175187195MSep 19 12:382.3GSep 19 11:12
mw1176174193MSep 19 12:392.3GSep 19 11:13
mw1177168202MSep 19 12:392.3GSep 19 11:13
mw1178189204MSep 19 12:392.3GSep 19 11:36
mw1179144199MSep 19 12:392.3GSep 19 11:13
mw1180174206MSep 19 12:392.3GSep 19 11:12
mw1181152206MSep 19 12:392.3GSep 19 11:12
mw1182184202MSep 19 12:392.3GSep 19 11:31
mw1183180202MSep 19 12:392.3GSep 19 11:12
mw1184149209MSep 19 12:392.3GSep 19 11:12
mw1185122212MSep 19 12:392.3GSep 19 11:13
mw1186174206MSep 19 12:392.3GSep 19 11:12
mw1187181209MSep 19 12:392.3GSep 19 11:57
mw1188162202MSep 19 12:392.3GSep 19 11:53
mw1189260Sep 16 12:3384MSep 19 11:13
mw11901417MSep 19 10:4893MSep 19 12:17
mw1191370Sep 19 10:1530MSep 19 12:37
mw11922521MSep 17 16:391.2GSep 19 11:13
mw11931721MSep 17 16:421.2GSep 19 11:13
mw11943462MSep 17 16:414.6GSep 19 12:36
mw11954621MSep 17 16:431.2GSep 19 11:13
mw11962983MSep 17 16:382.1GSep 19 11:12
mw11973373MSep 17 16:444.5GSep 19 11:12
mw11982321MSep 17 16:441.2GSep 19 11:12
mw11992874MSep 17 16:394.6GSep 19 11:12
mw12003633MSep 18 22:241.2GSep 19 12:38
mw12014733MSep 17 16:411.2GSep 19 12:22
mw12023262MSep 17 16:434.4GSep 19 11:12
mw12035321MSep 16 12:091.2GSep 19 11:12
mw12044621MSep 16 11:591.2GSep 19 11:12
mw12052921MSep 16 12:421.2GSep 19 11:13
mw12063833MSep 17 16:501.2GSep 19 11:12
mw12072433MSep 19 10:491.2GSep 19 11:12
mw12082033MSep 17 16:501.2GSep 19 11:13
mw1209184205MSep 19 12:392.3GSep 19 11:12
mw1210162206MSep 19 12:392.3GSep 19 11:13
mw1211182202MSep 19 12:392.3GSep 19 11:12
mw1212174203MSep 19 12:392.3GSep 19 11:12
mw1213191209MSep 19 12:392.3GSep 19 11:12
mw1214146208MSep 19 12:392.3GSep 19 11:12
mw1215186202MSep 19 12:392.3GSep 19 12:25
mw1216202204MSep 19 12:392.3GSep 19 11:12
mw1218166205MSep 19 12:392.3GSep 19 11:12
mw1219170206MSep 19 12:392.3GSep 19 11:13
mw1220162188MSep 19 12:392.2GSep 19 11:12
mw12212821MSep 17 16:511.2GSep 19 11:13
mw12223521MSep 17 16:521.2GSep 19 11:12
mw12232721MSep 17 16:471.2GSep 19 11:12
mw12243117MSep 17 16:531.1GSep 19 11:12
mw12253433MSep 19 10:481.2GSep 19 11:12
mw12263117MSep 17 16:531.1GSep 19 11:12
mw12272321MSep 17 16:481.2GSep 19 12:35
mw12283633MSep 17 16:471.2GSep 19 11:13
mw12292321MSep 17 16:521.2GSep 19 11:24
mw12302517MSep 17 16:591.1GSep 19 11:13
mw12313529MSep 18 19:051.1GSep 19 12:07
mw12324217MSep 17 17:031.1GSep 19 11:27
mw12333617MSep 17 17:041.1GSep 19 11:12
mw12342842MSep 19 10:471.1GSep 19 11:12
mw12353729MSep 17 17:021.1GSep 19 11:12
mw1236173191MSep 19 12:392.2GSep 19 11:13
mw1237170284MSep 19 12:396.6GSep 19 11:12
mw1238156274MSep 19 12:396.6GSep 19 11:12
mw1239212183MSep 19 12:392.2GSep 19 11:13
mw1240196203MSep 19 12:392.3GSep 19 11:12
mw1241183287MSep 19 12:396.8GSep 19 11:13
mw1242170198MSep 19 12:392.3GSep 19 11:12
mw1243186203MSep 19 12:392.3GSep 19 11:13
mw1244168204MSep 19 12:392.3GSep 19 11:12
mw1245172209MSep 19 12:392.3GSep 19 11:12
mw1246153290MSep 19 12:396.7GSep 19 11:12
mw1247194200MSep 19 12:392.3GSep 19 11:13
mw1248186216MSep 19 12:392.3GSep 19 11:13
mw124918717MSep 19 12:3978MSep 19 12:38
mw125014817MSep 19 12:3978MSep 19 12:39
mw125118423MSep 19 12:39106MSep 19 12:34
mw125213824MSep 19 12:39104MSep 19 11:37
mw125315424MSep 19 12:39105MSep 19 12:17
mw125414523MSep 19 12:39104MSep 19 11:13
mw125517817MSep 19 12:3981MSep 19 12:39
mw125622217MSep 19 12:3982MSep 19 12:38
mw125716417MSep 19 12:3982MSep 19 12:38
mw125816417MSep 19 12:3875MSep 19 12:39
mw1259021MSep 19 06:34526MSep 19 12:10
mw1260021MSep 19 06:54531MSep 19 12:10
mw1261286173MSep 19 12:391.2GSep 19 11:12
mw1262292170MSep 19 12:391.2GSep 19 11:12
mw1263248182MSep 19 12:391.2GSep 19 11:12
mw1264262174MSep 19 12:391.2GSep 19 11:12
mw1265266181MSep 19 12:391.2GSep 19 11:12
mw12660116MAug 29 09:441020MSep 19 11:13
mw1267274160MSep 19 12:391.1GSep 19 12:19
mw1268262159MSep 19 12:391.1GSep 19 11:12
mw1269282162MSep 19 12:391.1GSep 19 11:13
mw1270262156MSep 19 12:391.1GSep 19 12:20
mw1271281157MSep 19 12:391.1GSep 19 11:12
mw1272250137MSep 19 12:39958MSep 19 11:12
mw1273298141MSep 19 12:39951MSep 19 11:12
mw1274262135MSep 19 12:39927MSep 19 11:13
mw1275240145MSep 19 12:39936MSep 19 11:13
mw12764117MSep 19 10:48602MSep 19 11:12
mw1277354.6MSep 16 12:49602MSep 19 11:33
mw1278454.4MSep 16 12:08602MSep 19 11:12
mw1279454.5MSep 16 11:55587MSep 19 12:29
mw12804721MSep 16 12:48581MSep 19 11:12
mw1281414.5MSep 16 12:09582MSep 19 11:18
mw1282484.5MSep 16 12:25586MSep 19 11:12
mw12833933MSep 19 10:48582MSep 19 11:13
mw12843318MSep 18 20:11580MSep 19 11:13
mw1285514.6MSep 16 12:29580MSep 19 11:13
mw1286424.4MSep 16 12:12580MSep 19 11:13
mw1287444.5MSep 16 12:17577MSep 19 11:13
mw12884217MSep 19 10:48580MSep 19 11:12
mw1289464.5MSep 16 12:31576MSep 19 11:13
mw12905717MSep 19 10:48501MSep 19 11:12
mw129300Jun 27 14:32360MSep 19 11:12
mw129400Jun 27 14:20362MSep 19 12:14
mw129500Jun 27 15:58360MSep 19 11:12
mw129600Jun 27 15:46361MSep 19 11:12
mw129700Jun 28 09:39354MSep 19 11:13
mw129800Jun 28 10:04349MSep 19 11:12
mw1299768.8MSep 19 06:27474MSep 19 11:13
mw1300978.8MSep 19 06:26446MSep 19 11:12
mw1301838.8MSep 19 06:27446MSep 19 11:12
mw1302978.8MSep 19 06:26411MSep 19 11:12
mw1303798.8MSep 19 06:27446MSep 19 11:33
mw1304728.8MSep 19 06:26449MSep 19 11:12
mw13051168.8MSep 19 06:27410MSep 19 11:12
mw1306828.8MSep 19 06:26412MSep 19 11:12

There seems to be a larger underlying issue (captured in T111441).

Since @aaron and @Anomie merged and back-ported https://gerrit.wikimedia.org/r/311172, is that sufficient to try wmf.19 again this Tuesday? Or should T111441 block the train until it is resolved?

It should be noted, as it may factor in to the decision, that there will be no train next week (2016-09-26) since ops will be at an offsite.

thcipriani raised the priority of this task from Medium to High.Sep 19 2016, 5:59 PM
aaron added a comment.Sep 19 2016, 6:12 PM

The backport should get the maintenance script call rate back to the old status quo (rare), so re-deploy is worth attempting.

T111441 is most probably a beast too large to get rid of in a short time, and I don't even know who could be diverted to work on it.

I would at least clear out the HHVM bytecode caches for to be safe. A server that failed had a 202MBytes file:

Got a Failed to run getConfiguration.php. on mw1215.eqiad.wmnet:

$ ls -hl /var/cache/hhvm/
total 2.5G
-rw-r--r-- 1 www-data www-data 202M Sep 16 12:28 cli.hhbc.sq3
-rw-r--r-- 1 www-data www-data 2.3G Sep 16 04:10 fcgi.hhbc.sq3

That is from an api.php call. If the file limit stands true, cli.hhbc.sq3 is only 202Mbytes so I am not sure why that fails.

The table above is for today and show a lot of servers are around that size if not bigger, thus I highly suspect that will trigger the ulimit again.

Ideally, it would nice to figure out why the bytecode cache has to be written to. My assumption is that we should get it compiled once on each deploy, and mwscript would not have to mess with it.

When proceeding with the deployment, I highly recommend to do the jobrunners one at a time. Watch logstash for it. Note that /var/log/mediawiki/jobrunner.log is only readable by root for now (due to T146040)

The backport should get the maintenance script call rate back to the old status quo (rare), so re-deploy is worth attempting.

Trying the redeploy now.

wmf.19 is on group0 wikis currently.

Plan is to move group1 at 21:45, and all wikis at 22:00 if there is nothing suspicious in the error logs and no reports of any breakage.

greg added a comment.Sep 19 2016, 11:01 PM

We're back to wmf.18 now.

We're back to wmf.18 now.

explanation on the wmf.19 blockers task T143328#2651001 the rollback was not due to the problems outlined in this task; although, I was not able to confirm that this task was resolved.

T145839: Account creation results in fatal MWException was duplicated into this task, and this is still open, so I am confused. Account creation is no longer broken, right?

greg added a comment.Sep 26 2016, 10:30 PM

T145839: Account creation results in fatal MWException was duplicated into this task, and this is still open, so I am confused. Account creation is no longer broken, right?

Account creation is no longer broke, correct.

What are the next steps here, @thcipriani (on duty train conductor, even though we don't have a train this week), @hashar (was on duty for this issue), and @aaron (person who wrote some patches to address at least parts of this)?

aaron added a comment.Sep 27 2016, 9:20 PM

It should be a goal to replace usage of the maintenance config script, but not high priority IMO.

greg lowered the priority of this task from High to Medium.Sep 27 2016, 9:22 PM
matmarex removed a subscriber: matmarex.Sep 28 2016, 7:15 PM

Might be surfacing again from CirrusSearch related web requests (that shell out to mwscript getConfig): T161520

That is still surfacing here and there from time to time.

Maybe we can have Scap to prime the CLI byte code cache ( /var/cache/hhvm/cli.hhbc.sq3 ). This way when HHVM is run in CLI it would not trigger the cache update (and thus reaching the ulimit).

Another way is to have Scap to delete /var/cache/hhvm/cli.hhbc.sq3 on deployment?

greg added a project: Scap.Oct 30 2017, 4:48 PM
greg moved this task from INBOX to Backlog on the Release-Engineering-Team board.

Change 391170 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Do not limit filesize when running a maintenance script

https://gerrit.wikimedia.org/r/391170

Change 391170 merged by jenkins-bot:
[mediawiki/core@master] Do not limit filesize when running a maintenance script

https://gerrit.wikimedia.org/r/391170

tstarling closed this task as Resolved.Dec 8 2017, 4:27 AM
tstarling claimed this task.
tstarling added a subscriber: tstarling.

The fix is merged, and searching logstash for SiteConfiguration shows no further errors of this type.