Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Sep 15 2016, 7:49 PM

Description

Incident report for some summary https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki

Summary

Some jobs such as account creation, account rename or Wikidata dispatcher invokes SiteConfiguration::getConfig() since they have to act on several wikis. That method shells out to mwscript maintenance/getConfiguration.php with ulimits being applied, most notably a file size limit of 512MBytes.

However, when HHVM runs the command, it tries to update the byte code cache (either /var/cache/hhvm/fcgi.sq3 or /var/cache/hhvm/cli.sq3). That causes a system error EFBIG (File too large) and the job fail.

See https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki incident report for actionables.

See below comments for debugging / technical details.

Original task

After pushing 1.28.0-wmf.19 to group1 (which includes wikidatawiki) the dispatch seems to have stopped entirely.

https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch
https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch-script

https://www.wikidata.org/wiki/Special:DispatchStats with wikidata on wmf.19 had:

Change log statistics

	ID	Timestamp
Oldest	373786394	19:45, 12 September 2016
Newest	374498192	19:48, 15 September 2016

Dispatch statistics

	Site	Position	Pending	Lag	Touched
Freshest	eswikibooks	374495008	3,184	30 minutes	19:18, 15 September 2016
Median	skwiki	374494927	3,265	30 minutes	19:17, 15 September 2016
Stalest	fiwiki	374493639	4,553	37 minutes	19:48, 15 September 2016
Average	-	-	3,304	30 minutes	-

I reverted wikidata back to .18 and the dispatch got handled just fine. Stats view:

Change log statistics

	ID	Timestamp
Oldest	373789761	20:00, 12 September 2016
Newest	374500962	20:08, 15 September 2016

Dispatch statistics

	Site	Position	Pending	Lag	Touched
Freshest	tywiki	374500941	21	0 minutes	20:08, 15 September 2016
Median	extwiki	374500920	42	0 minutes	20:08, 15 September 2016
Stalest	bnwikisource	374500917	45	0 minutes	20:08, 15 September 2016
Average	-	-	34	0 minutes	-

I have noticed:

 {
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 422,
  "function": "getConfig",
  "class": "SiteConfiguration",
  "type": "->",
  "args": [
    "string",
    "string"
  ]
},
{
  "function": "{closure}",
  "class": "JobQueueGroup",
  "type": "->",
  "args": [
    "boolean",
    "integer",
    "array",
    "NULL"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/libs/objectcache/WANObjectCache.php",
  "line": 987,
  "function": "call_user_func_array",
  "args": [
    "Closure",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/libs/objectcache/WANObjectCache.php",
  "line": 892,
  "function": "doGetWithSetCallback",
  "class": "WANObjectCache",
  "type": "->",
  "args": [
    "string",
    "integer",
    "Closure",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 425,
  "function": "getWithSetCallback",
  "class": "WANObjectCache",
  "type": "->",
  "args": [
    "string",
    "integer",
    "Closure",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 293,
  "function": "getCachedConfigVar",
  "class": "JobQueueGroup",
  "type": "->",
  "args": [
    "string"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 304,
  "function": "getQueueTypes",
  "class": "JobQueueGroup",
  "type": "->",
  "args": []
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php",
  "line": 152,
  "function": "getDefaultQueueTypes",
  "class": "JobQueueGroup",
  "type": "->",
  "args": []
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/includes/Notifications/JobQueueChangeNotificationSender.php",
  "line": 59,
  "function": "push",
  "class": "JobQueueGroup",
  "type": "->",
  "args": [
    "Wikibase\\ChangeNotificationJob"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/includes/ChangeDispatcher.php",
  "line": 248,
  "function": "sendNotification",
  "class": "Wikibase\\Repo\\Notifications\\JobQueueChangeNotificationSender",
  "type": "->",
  "args": [
    "string",
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php",
  "line": 210,
  "function": "dispatchTo",
  "class": "Wikibase\\Repo\\ChangeDispatcher",
  "type": "->",
  "args": [
    "array"
  ]
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/maintenance/doMaintenance.php",
  "line": 110,
  "function": "execute",
  "class": "Wikibase\\DispatchChanges",
  "type": "->",
  "args": []
},
{
  "file": "/srv/mediawiki/php-1.28.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php",
  "line": 271,
  "args": [
    "string"
  ],
  "function": "require_once"
},
{
  "file": "/srv/mediawiki/multiversion/MWScript.php",
  "line": 97,
  "args": [
    "string"
  ],
  "function": "require_once"
}

Details

Subject	Repo	Branch	Lines +/-
Do not limit filesize when running a maintenance script	mediawiki/core	master	+1 -1
Inline doc for $wgMaxShell*	operations/mediawiki-config	master	+5 -3
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()	mediawiki/core	master	+3 -1
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()	mediawiki/core	wmf/1.28.0-wmf.19	+3 -1
All wikis back to 1.28.0-wmf.18	operations/mediawiki-config	master	+895 -895
Fix wikidata to .18 (previous was testwikidata)	operations/mediawiki-config	master	+1 -1
All wiki but wikidatawiki to php-1.28.0-wmf.19	operations/mediawiki-config	master	+300 -300
wikidatawiki back to 1.28.0-wmf.18	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Legoktm	T75901 Drop PHP 5.3 support
Declined	• demon	T91590 [Spike] Try out hack (<?hh) for mediawiki-config
Resolved	Joe	T104147 can we get rid of rsvg security patch?
Resolved	Reedy	T94149 Get rid of Zend 5.5 tests for wmf branches
Resolved	None	T86081 Complete the use of HHVM over Zend PHP on the Wikimedia cluster
Resolved	Jdforrester-WMF	T172165 Require either PHP 7.0+ or HHVM in MW 1.31
Resolved	None	T190909 php5 is missing on deploy1001 which breaks foreachwiki & l10nupdate
Resolved	fgiunchedi	T146285 Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7)
Declined	thcipriani	T143328 MW-1.28.0-wmf.19 deployment blockers
Resolved	tstarling	T145819 Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes
Resolved	tstarling	T111441 SiteConfiguration::getConfig() does not work in Wikimedia production

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T145819#2642193, @Addshore wrote:

2016-09-15 19:59:54 [139a37a9f227e562aefeee8d] terbium wikidatawiki 1.28.0-wmf.19 exception ERROR: [139a37a9f227e562aefeee8d] [no req]   MWException from line 561 of /srv/mediawiki/php-1.28.0-wmf.19/includes/SiteConfiguration.php: Failed to run getConfiguration.php. {"exception_id":"139a37a9f227e562aefeee8d"}

Maybe T145839 is related:

[V9vCrwpEFhUAADfGIToAAAAC] /wiki/Special:CreateAccount MWException from line 561 of /srv/mediawiki/php-master/includes/SiteConfiguration.php: Failed to run getConfiguration.php.
...

And strace is again my friend. Running it with:

-y to show file descriptor names
-e trace=desc for system calls related to file descriptor
-s 2048 for large strings

[pid 25957] open("/var/cache/hhvm/cli.hhbc.sq3", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 7
[pid 25957] open("/var/cache/hhvm/cli.hhbc.sq3-journal", O_RDWR|O_CLOEXEC) = 8

That is sqlite journaling.

[pid 25957] read(8</var/cache/hhvm/cli.hhbc.sq3-journal>, ..., 1024 ) = 1024
[pid 25957] lseek(7</var/cache/hhvm/cli.hhbc.sq3>, 844421120, SEEK_SET) = 844421120
[pid 25957] write(7</var/cache/hhvm/cli.hhbc.sq3>, ... , 1024 ) = -1 EFBIG (File too large)
[pid 25957] --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=25957, si_uid=33} ---

[pid 25957] +++ killed by SIGXFSZ (core dumped) +++
[pid 25954] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=25957, si_status=SIGXFSZ, si_utime=3, si_stime=22} ---

Command for copy pasting:
sudo -u www-data strace -y -e trace=desc -s2048 -f /bin/bash /srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki dewikinews --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180'; echo "Exit code: $?"

Without strace but with MW_INCLUDE_STDERR=1

/srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh: line 101: 24089 File size limit exceeded/usr/bin/timeout $MW_WALL_CLOCK_LIMIT /bin/bash -c "$1" 3>&-

TLDR: when running mwscript on terbium the HHVM cache file /var/cache/hhvm/cli.hhbc.sq3 is committed to. Sqlite3 then use the journaling system to rewrite it and that explodes due to the ulimit -f.

hashar edited projects, added HHVM; removed Patch-For-Review.Sep 16 2016, 10:17 AM

hashar merged a task: T145839: Account creation results in fatal MWException.Sep 16 2016, 10:43 AM

hashar added a project: Beta-Cluster-reproducible.

hashar added subscribers: JJMC89, FastLizard4.

SiteConfiguration::getConfig() uses wfShellWikiCmd() to craft the command it uses the PHP interpreter from $wgPhpCli, that is '/usr/bin/php' which on terbium is hhvm.

549             $retVal = 1;
550             $cmd = wfShellWikiCmd(
551                 "$IP/maintenance/getConfiguration.php",
552                 [
553                     '--wiki', $wiki,
554                     '--settings', implode( ' ', $settings ),
555                     '--format', 'PHP'
556                 ]
557             );

Then the call is made. Note the comment about ulimit5.sh breaking the call !!! The memory limit is set to zero explicitly :(

558             // ulimit5.sh breaks this call
559             $data = trim( wfShellExec( $cmd, $retVal, [], [ 'memory' => 0 ] ) );
560             if ( $retVal != 0 || !strlen( $data ) ) {
561                 throw new MWException( "Failed to run getConfiguration.php." );
562             }
563             $res = unserialize( $data );
564             if ( !is_array( $res ) ) {
565                 throw new MWException( "Failed to unserialize configuration array." );
566             }
567             $this->cfgCache[$wiki] = $this->cfgCache[$wiki] + $res;

A monkey patch would be to pass to wfShellExec the env 'filesize' => 0 to workaround the write to the HHVM cache file.

Unsolved

Why on 1.28.0.wmf-18 /var/cache/hhvm/cli.hhbc.sq3 is not updated/written to but it is on 1.28.0.wmf-19.

On the Beta-Cluster-Infrastructure

Trying with the ulimit at 512k

sudo -u www-data /bin/bash /srv/mediawiki/php-master/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki wikidatawiki --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180;'; echo "Exit code: $?"

OK deployment-mediawiki05, Jessie, HHVM 3.12.7+dfsg-1+wmf1
OK deployment-tin, Trusty, HHVM 3.12.1+dfsg-1~wmf2+trusty0

Upgraded HHVM on deployment-tin to 3.12.7+dfsg-1+wmf1~trusty1. It still write to the journal file and run fine.

I am trying in production now :(

OK mw1273, Jessie, 3.12.7+dfsg-1+wmf1

BAD terbium, Trusty, 3.12.7+dfsg-1+wmf1~trusty1

Running:

sudo -u www-data /bin/bash /srv/mediawiki/php-1.28.0-wmf.19/includes/limit.sh '/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php --wiki dewikinews --settings wgJobClasses --format PHP' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50;MW_CGROUP=/sys/fs/cgroup/memory/mediawiki/job; MW_MEM_LIMIT=0; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180'; echo "Exit code: $?"

WORKS on tin
FAIL on terbium

On terbium I then delete /var/cache/hhvm/cli.hhbc.sq3 and the command now pass.

On terbium cli.hhbc.sq3 was 876 MBytes. It is now down to 17 MBytes.

Change 311118 had a related patch set uploaded (by Hashar):
Inline doc for $wgMaxShell*

https://gerrit.wikimedia.org/r/311118

gerritbot added a project: Patch-For-Review.Sep 16 2016, 12:17 PM

Got a Failed to run getConfiguration.php. on mw1215.eqiad.wmnet:

$ ls -hl /var/cache/hhvm/
total 2.5G
-rw-r--r-- 1 www-data www-data 202M Sep 16 12:28 cli.hhbc.sq3
-rw-r--r-- 1 www-data www-data 2.3G Sep 16 04:10 fcgi.hhbc.sq3

That is from an api.php call. If the file limit stands true, cli.hhbc.sq3 is only 202Mbytes so I am not sure why that fails.

Mentioned in SAL (#wikimedia-operations) [2016-09-16T12:40:02Z] <hashar> Going to rollback all Wikis back to 1.28.0-wmf.18 . Despite much investigation, a bunch of jobs are broken due to T145819 which includes Special:CreateAccount :(

Change 311120 had a related patch set uploaded (by Hashar):
All wikis back to 1.28.0-wmf.18

https://gerrit.wikimedia.org/r/311120

Change 311120 merged by jenkins-bot:
All wikis back to 1.28.0-wmf.18

https://gerrit.wikimedia.org/r/311120

Mentioned in SAL (#wikimedia-operations) [2016-09-16T12:50:09Z] <hashar@tin> rebuilt wikiversions.php and synchronized wikiversions files: All wikis back to 1.28.0-wmf.18 :( T145819

From T145839#2643508 (task about Special:CreateAccount broken:

Sorry I took time to notice this bug was about Special:CreateAccount being broken. I though it was just yet a random job failling due to T145819. So I kept investigating that other task.

Eventually I have browsed the dashboard for authentication metrics at https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen Screenshot of the last 24 hours shows a 100% error rate on account creation:

accountcreation_error.png (446×917 px, 58 KB)

I have reverted all wikis to 1.28.0-wmf.18 as a result which is reflected in the drop of errors at the far right of the graph above.

Sorry, should really have noticed that earlier.

All wikis are now back to 1.28.0-wmf.18. The Wikidata dispatch lag is all fine Account creation jobs are back as well.

Tobi_WMDE_SW subscribed.Sep 16 2016, 1:14 PM

hashar added a project: Wikimedia-Incident.Sep 16 2016, 1:21 PM

hashar mentioned this in T143328: MW-1.28.0-wmf.19 deployment blockers.Sep 16 2016, 1:26 PM

• MZMcBride subscribed.Sep 16 2016, 1:33 PM

hashar mentioned this in T145596: Renames getting stuck on mediawiki.org (Sept 13, 2016).Sep 16 2016, 1:50 PM

hashar mentioned this in T145851: GlobalRename is broken and throws MWExceptions.Sep 16 2016, 1:52 PM

Ladsgroup awarded a token.Sep 16 2016, 1:55 PM

Addshore awarded a token.Sep 16 2016, 2:19 PM

I am enlarging the scope of this task. It is not just Wikidata but more about a variety of jobs that fails whenever shelling out to mwscript maintenance/getConfiguration.php.

Incident report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki

(Huge thanks to @Addshore for the support/pointers etc)

hashar added a subtask: T111441: SiteConfiguration::getConfig() does not work in Wikimedia production.Sep 16 2016, 2:50 PM

hashar renamed this task from Wikidata at 1.28.0-wmf.19 no more replicate to wikis (replag raise / dispatch stop) to Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes.Sep 16 2016, 2:55 PM

hashar removed a project: Patch-For-Review.

hashar updated the task description. (Show Details)

It looks like rMW6f9a246d25b2: Make JobQueueGroup::push() update the queuesHaveJobs() cache is what made job pushing start triggering T111441. It'll affect any code that pushes a job onto another wiki's job queue. @aaron might know if we could revert that patch until T111441 can be fixed, or if we can adjust the added code to only run if the job is being pushed on the local wiki.

Anomie mentioned this in T145079: Investigate slow transcludedin query.Sep 16 2016, 5:01 PM

Change 311168 had a related patch set uploaded (by Aaron Schulz):
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311168

gerritbot added a project: Patch-For-Review.Sep 16 2016, 6:17 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 16 2016, 6:17 PM

Change 311172 had a related patch set uploaded (by Aaron Schulz):
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311172

Change 311172 merged by jenkins-bot:
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311172

Change 311168 merged by jenkins-bot:
Avoid triggering SiteConfiguration lookup in JobQueueGroup::push()

https://gerrit.wikimedia.org/r/311168

hashar removed hashar as the assignee of this task.Sep 16 2016, 7:46 PM

ReleaseTaggerBot added projects: MW-1.28-release-notes, MW-1.28-release (WMF-deploy-2016-09-13_(1.28.0-wmf.19)), MW-1.28-release (WMF-deploy-2016-09-20_(1.28.0-wmf.20)).Sep 16 2016, 8:00 PM

Change 311118 merged by jenkins-bot:
Inline doc for $wgMaxShell*

https://gerrit.wikimedia.org/r/311118

A view of messages referencing /srv/mediawiki/php-1.28.0-wmf.19/includes/SiteConfiguration.php as referenced by @Addshore previously in https://logstash.wikimedia.org/goto/259821dc32242eb3fde0cd02755685f6

Fields:

server: name of the server. codfw / snapshot hosts have been removed (had 0 errors)
n_err number of lines in logstash
cli_size human readable size of /var/cache/hhvm/cli.hhvm.sq3
cli_date date of file as reported by ls
fcgi_size/fcgi_date same but for /var/cache/hhvm/fcgi.hhvm.sq3

I have starred has that table for a while. Cant find anything specific of interest but maybe someone else will?

Seems like the cgi cache files have been trashed/regenerated on some hosts.

server	n_err	cli_size	cli_date	fcgi_size	fcgi_date
terbium	323	28M	Sep 19 12:10	20M	Aug 31 12:24
mw1017	0	139M	Sep 6 23:42	1.2G	Sep 19 11:12
mw1099	0	112M	Sep 14 23:22	1.2G	Sep 19 11:13
mw1152	0	41M	Aug 31 10:32	23M	Sep 19 11:27
mw1161	106	22M	Sep 19 06:27	1.1G	Sep 19 11:12
mw1162	66	22M	Sep 19 06:27	1.1G	Sep 19 11:32
mw1163	72	21M	Sep 19 06:27	1.1G	Sep 19 11:12
mw1164	92	21M	Sep 19 06:27	1.1G	Sep 19 11:13
mw1165	87	33M	Sep 19 06:27	1.1G	Sep 19 11:13
mw1166	89	21M	Sep 19 06:27	1.1G	Sep 19 11:12
mw1167	107	22M	Sep 19 06:27	1.1G	Sep 19 11:12
mw1168	76	21M	Sep 19 06:27	1.1G	Sep 19 11:12
mw1169	77	22M	Sep 19 06:27	1.1G	Sep 19 11:13
mw1170	184	206M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1171	170	277M	Sep 19 12:39	6.7G	Sep 19 11:12
mw1172	157	208M	Sep 19 12:39	2.3G	Sep 19 12:22
mw1173	176	210M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1174	162	284M	Sep 19 12:39	6.6G	Sep 19 11:12
mw1175	187	195M	Sep 19 12:38	2.3G	Sep 19 11:12
mw1176	174	193M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1177	168	202M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1178	189	204M	Sep 19 12:39	2.3G	Sep 19 11:36
mw1179	144	199M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1180	174	206M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1181	152	206M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1182	184	202M	Sep 19 12:39	2.3G	Sep 19 11:31
mw1183	180	202M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1184	149	209M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1185	122	212M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1186	174	206M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1187	181	209M	Sep 19 12:39	2.3G	Sep 19 11:57
mw1188	162	202M	Sep 19 12:39	2.3G	Sep 19 11:53
mw1189	26	0	Sep 16 12:33	84M	Sep 19 11:13
mw1190	14	17M	Sep 19 10:48	93M	Sep 19 12:17
mw1191	37	0	Sep 19 10:15	30M	Sep 19 12:37
mw1192	25	21M	Sep 17 16:39	1.2G	Sep 19 11:13
mw1193	17	21M	Sep 17 16:42	1.2G	Sep 19 11:13
mw1194	34	62M	Sep 17 16:41	4.6G	Sep 19 12:36
mw1195	46	21M	Sep 17 16:43	1.2G	Sep 19 11:13
mw1196	29	83M	Sep 17 16:38	2.1G	Sep 19 11:12
mw1197	33	73M	Sep 17 16:44	4.5G	Sep 19 11:12
mw1198	23	21M	Sep 17 16:44	1.2G	Sep 19 11:12
mw1199	28	74M	Sep 17 16:39	4.6G	Sep 19 11:12
mw1200	36	33M	Sep 18 22:24	1.2G	Sep 19 12:38
mw1201	47	33M	Sep 17 16:41	1.2G	Sep 19 12:22
mw1202	32	62M	Sep 17 16:43	4.4G	Sep 19 11:12
mw1203	53	21M	Sep 16 12:09	1.2G	Sep 19 11:12
mw1204	46	21M	Sep 16 11:59	1.2G	Sep 19 11:12
mw1205	29	21M	Sep 16 12:42	1.2G	Sep 19 11:13
mw1206	38	33M	Sep 17 16:50	1.2G	Sep 19 11:12
mw1207	24	33M	Sep 19 10:49	1.2G	Sep 19 11:12
mw1208	20	33M	Sep 17 16:50	1.2G	Sep 19 11:13
mw1209	184	205M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1210	162	206M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1211	182	202M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1212	174	203M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1213	191	209M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1214	146	208M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1215	186	202M	Sep 19 12:39	2.3G	Sep 19 12:25
mw1216	202	204M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1218	166	205M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1219	170	206M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1220	162	188M	Sep 19 12:39	2.2G	Sep 19 11:12
mw1221	28	21M	Sep 17 16:51	1.2G	Sep 19 11:13
mw1222	35	21M	Sep 17 16:52	1.2G	Sep 19 11:12
mw1223	27	21M	Sep 17 16:47	1.2G	Sep 19 11:12
mw1224	31	17M	Sep 17 16:53	1.1G	Sep 19 11:12
mw1225	34	33M	Sep 19 10:48	1.2G	Sep 19 11:12
mw1226	31	17M	Sep 17 16:53	1.1G	Sep 19 11:12
mw1227	23	21M	Sep 17 16:48	1.2G	Sep 19 12:35
mw1228	36	33M	Sep 17 16:47	1.2G	Sep 19 11:13
mw1229	23	21M	Sep 17 16:52	1.2G	Sep 19 11:24
mw1230	25	17M	Sep 17 16:59	1.1G	Sep 19 11:13
mw1231	35	29M	Sep 18 19:05	1.1G	Sep 19 12:07
mw1232	42	17M	Sep 17 17:03	1.1G	Sep 19 11:27
mw1233	36	17M	Sep 17 17:04	1.1G	Sep 19 11:12
mw1234	28	42M	Sep 19 10:47	1.1G	Sep 19 11:12
mw1235	37	29M	Sep 17 17:02	1.1G	Sep 19 11:12
mw1236	173	191M	Sep 19 12:39	2.2G	Sep 19 11:13
mw1237	170	284M	Sep 19 12:39	6.6G	Sep 19 11:12
mw1238	156	274M	Sep 19 12:39	6.6G	Sep 19 11:12
mw1239	212	183M	Sep 19 12:39	2.2G	Sep 19 11:13
mw1240	196	203M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1241	183	287M	Sep 19 12:39	6.8G	Sep 19 11:13
mw1242	170	198M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1243	186	203M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1244	168	204M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1245	172	209M	Sep 19 12:39	2.3G	Sep 19 11:12
mw1246	153	290M	Sep 19 12:39	6.7G	Sep 19 11:12
mw1247	194	200M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1248	186	216M	Sep 19 12:39	2.3G	Sep 19 11:13
mw1249	187	17M	Sep 19 12:39	78M	Sep 19 12:38
mw1250	148	17M	Sep 19 12:39	78M	Sep 19 12:39
mw1251	184	23M	Sep 19 12:39	106M	Sep 19 12:34
mw1252	138	24M	Sep 19 12:39	104M	Sep 19 11:37
mw1253	154	24M	Sep 19 12:39	105M	Sep 19 12:17
mw1254	145	23M	Sep 19 12:39	104M	Sep 19 11:13
mw1255	178	17M	Sep 19 12:39	81M	Sep 19 12:39
mw1256	222	17M	Sep 19 12:39	82M	Sep 19 12:38
mw1257	164	17M	Sep 19 12:39	82M	Sep 19 12:38
mw1258	164	17M	Sep 19 12:38	75M	Sep 19 12:39
mw1259	0	21M	Sep 19 06:34	526M	Sep 19 12:10
mw1260	0	21M	Sep 19 06:54	531M	Sep 19 12:10
mw1261	286	173M	Sep 19 12:39	1.2G	Sep 19 11:12
mw1262	292	170M	Sep 19 12:39	1.2G	Sep 19 11:12
mw1263	248	182M	Sep 19 12:39	1.2G	Sep 19 11:12
mw1264	262	174M	Sep 19 12:39	1.2G	Sep 19 11:12
mw1265	266	181M	Sep 19 12:39	1.2G	Sep 19 11:12
mw1266	0	116M	Aug 29 09:44	1020M	Sep 19 11:13
mw1267	274	160M	Sep 19 12:39	1.1G	Sep 19 12:19
mw1268	262	159M	Sep 19 12:39	1.1G	Sep 19 11:12
mw1269	282	162M	Sep 19 12:39	1.1G	Sep 19 11:13
mw1270	262	156M	Sep 19 12:39	1.1G	Sep 19 12:20
mw1271	281	157M	Sep 19 12:39	1.1G	Sep 19 11:12
mw1272	250	137M	Sep 19 12:39	958M	Sep 19 11:12
mw1273	298	141M	Sep 19 12:39	951M	Sep 19 11:12
mw1274	262	135M	Sep 19 12:39	927M	Sep 19 11:13
mw1275	240	145M	Sep 19 12:39	936M	Sep 19 11:13
mw1276	41	17M	Sep 19 10:48	602M	Sep 19 11:12
mw1277	35	4.6M	Sep 16 12:49	602M	Sep 19 11:33
mw1278	45	4.4M	Sep 16 12:08	602M	Sep 19 11:12
mw1279	45	4.5M	Sep 16 11:55	587M	Sep 19 12:29
mw1280	47	21M	Sep 16 12:48	581M	Sep 19 11:12
mw1281	41	4.5M	Sep 16 12:09	582M	Sep 19 11:18
mw1282	48	4.5M	Sep 16 12:25	586M	Sep 19 11:12
mw1283	39	33M	Sep 19 10:48	582M	Sep 19 11:13
mw1284	33	18M	Sep 18 20:11	580M	Sep 19 11:13
mw1285	51	4.6M	Sep 16 12:29	580M	Sep 19 11:13
mw1286	42	4.4M	Sep 16 12:12	580M	Sep 19 11:13
mw1287	44	4.5M	Sep 16 12:17	577M	Sep 19 11:13
mw1288	42	17M	Sep 19 10:48	580M	Sep 19 11:12
mw1289	46	4.5M	Sep 16 12:31	576M	Sep 19 11:13
mw1290	57	17M	Sep 19 10:48	501M	Sep 19 11:12
mw1293	0	0	Jun 27 14:32	360M	Sep 19 11:12
mw1294	0	0	Jun 27 14:20	362M	Sep 19 12:14
mw1295	0	0	Jun 27 15:58	360M	Sep 19 11:12
mw1296	0	0	Jun 27 15:46	361M	Sep 19 11:12
mw1297	0	0	Jun 28 09:39	354M	Sep 19 11:13
mw1298	0	0	Jun 28 10:04	349M	Sep 19 11:12
mw1299	76	8.8M	Sep 19 06:27	474M	Sep 19 11:13
mw1300	97	8.8M	Sep 19 06:26	446M	Sep 19 11:12
mw1301	83	8.8M	Sep 19 06:27	446M	Sep 19 11:12
mw1302	97	8.8M	Sep 19 06:26	411M	Sep 19 11:12
mw1303	79	8.8M	Sep 19 06:27	446M	Sep 19 11:33
mw1304	72	8.8M	Sep 19 06:26	449M	Sep 19 11:12
mw1305	116	8.8M	Sep 19 06:27	410M	Sep 19 11:12
mw1306	82	8.8M	Sep 19 06:26	412M	Sep 19 11:12

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Sep 19 2016, 4:23 PM

There seems to be a larger underlying issue (captured in T111441).

Since @aaron and @Anomie merged and back-ported https://gerrit.wikimedia.org/r/311172, is that sufficient to try wmf.19 again this Tuesday? Or should T111441 block the train until it is resolved?

It should be noted, as it may factor in to the decision, that there will be no train next week (2016-09-26) since ops will be at an offsite.

thcipriani raised the priority of this task from Medium to High.Sep 19 2016, 5:59 PM

The backport should get the maintenance script call rate back to the old status quo (rare), so re-deploy is worth attempting.

T111441 is most probably a beast too large to get rid of in a short time, and I don't even know who could be diverted to work on it.

I would at least clear out the HHVM bytecode caches for to be safe. A server that failed had a 202MBytes file:

In T145819#2643463, @hashar wrote:
Got a Failed to run getConfiguration.php. on mw1215.eqiad.wmnet:
$ ls -hl /var/cache/hhvm/
total 2.5G
-rw-r--r-- 1 www-data www-data 202M Sep 16 12:28 cli.hhbc.sq3
-rw-r--r-- 1 www-data www-data 2.3G Sep 16 04:10 fcgi.hhbc.sq3
That is from an api.php call. If the file limit stands true, cli.hhbc.sq3 is only 202Mbytes so I am not sure why that fails.

The table above is for today and show a lot of servers are around that size if not bigger, thus I highly suspect that will trigger the ulimit again.

Ideally, it would nice to figure out why the bytecode cache has to be written to. My assumption is that we should get it compiled once on each deploy, and mwscript would not have to mess with it.

When proceeding with the deployment, I highly recommend to do the jobrunners one at a time. Watch logstash for it. Note that /var/log/mediawiki/jobrunner.log is only readable by root for now (due to T146040)

In T145819#2649573, @aaron wrote:

The backport should get the maintenance script call rate back to the old status quo (rare), so re-deploy is worth attempting.

Trying the redeploy now.

wmf.19 is on group0 wikis currently.

Plan is to move group1 at 21:45, and all wikis at 22:00 if there is nothing suspicious in the error logs and no reports of any breakage.

We're back to wmf.18 now.

In T145819#2650834, @greg wrote:

We're back to wmf.18 now.

explanation on the wmf.19 blockers task T143328#2651001 the rollback was not due to the problems outlined in this task; although, I was not able to confirm that this task was resolved.

T145839: Account creation results in fatal MWException was duplicated into this task, and this is still open, so I am confused. Account creation is no longer broken, right?

In T145819#2663000, @matmarex wrote:

T145839: Account creation results in fatal MWException was duplicated into this task, and this is still open, so I am confused. Account creation is no longer broken, right?

Account creation is no longer broke, correct.

What are the next steps here, @thcipriani (on duty train conductor, even though we don't have a train this week), @hashar (was on duty for this issue), and @aaron (person who wrote some patches to address at least parts of this)?

It should be a goal to replace usage of the maintenance config script, but not high priority IMO.

greg lowered the priority of this task from High to Medium.Sep 27 2016, 9:22 PM

greg removed projects: Patch-For-Review, Release, Release-Engineering-Team (Deployment-Blockers).

matmarex unsubscribed.Sep 28 2016, 7:15 PM

Might be surfacing again from CirrusSearch related web requests (that shell out to mwscript getConfig): T161520

Dealing with the HHVM byte code cache is being forked as sub task T161598

• MoritzMuehlenhoff removed projects: MW-1.28-release (WMF-deploy-2016-09-20_(1.28.0-wmf.20)), HHVM.Apr 7 2017, 7:22 AM

hashar mentioned this in T171903: mw1209 /usr/bin/timeout: the monitored command dumped core.Jul 28 2017, 4:26 PM

That is still surfacing here and there from time to time.

Maybe we can have Scap to prime the CLI byte code cache ( /var/cache/hhvm/cli.hhbc.sq3 ). This way when HHVM is run in CLI it would not trigger the cache update (and thus reaching the ulimit).

Another way is to have Scap to delete /var/cache/hhvm/cli.hhbc.sq3 on deployment?

Krinkle removed a project: MW-1.28-release (WMF-deploy-2016-09-13_(1.28.0-wmf.19)).Oct 6 2017, 9:45 PM

Krinkle updated the task description. (Show Details)

greg added a project: Scap.Oct 30 2017, 4:48 PM

greg moved this task from INBOX to Backlog on the Release-Engineering-Team board.

greg edited projects, added Release-Engineering-Team (Backlog); removed Release-Engineering-Team.

hashar mentioned this in T146285: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7).Nov 1 2017, 11:38 PM

Change 391170 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Do not limit filesize when running a maintenance script

https://gerrit.wikimedia.org/r/391170

gerritbot added a project: Patch-For-Review.Nov 14 2017, 9:46 AM

tstarling mentioned this in T172165: Require either PHP 7.0+ or HHVM in MW 1.31.Nov 14 2017, 10:38 AM

Change 391170 merged by jenkins-bot:
[mediawiki/core@master] Do not limit filesize when running a maintenance script

https://gerrit.wikimedia.org/r/391170

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)).Nov 14 2017, 5:01 PM

Krinkle added a parent task: T146285: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7).Nov 15 2017, 7:20 AM