Page MenuHomePhabricator

MW-1.28.0-wmf.7 deployment blockers
Closed, ResolvedPublic

Event Timeline

hashar subscribed.

Per Releng meeting, I am conducting that one.

Mentioned in SAL [2016-06-21T12:30:39Z] <hashar> T136973 started cut of branch wmf/1.28.0-wmf.7

Due to conflict with personal duties, I cant conduct the train. Since I was sick yesterday we already had Mukunda as backup for branch cut and Tyler for actual deployment. We agreed I would cut the branch (it is in process) and Tyler confirmed he would be able to handle group0/group1 switches.

Mentioned in SAL [2016-06-21T13:15:09Z] <hashar> T136973 applied all security patches to 1.28.0-wmf.7

Change 295339 had a related patch set uploaded (by Hashar):
Group0 to 1.28.0-wmf.7

https://gerrit.wikimedia.org/r/295339

Mentioned in SAL [2016-06-21T13:48:51Z] <hashar@tin> Started scap: testwiki to 1.28.0-wmf.7 T136973

Mentioned in SAL [2016-06-21T13:53:09Z] <hashar@tin> scap aborted: testwiki to 1.28.0-wmf.7 T136973 (duration: 04m 17s)

Mentioned in SAL [2016-06-21T13:53:45Z] <hashar@tin> Started scap: testwiki to 1.28.0-wmf.7 (take two) T136973

Mentioned in SAL [2016-06-21T13:55:20Z] <hashar@tin> scap aborted: testwiki to 1.28.0-wmf.7 (take two) T136973 (duration: 01m 35s)

Mentioned in SAL [2016-06-21T13:55:35Z] <hashar@tin> Started scap: testwiki to 1.28.0-wmf.7 (take three) T136973

scap to testwiki fails though:

14:06:23 Started scap: (no message)
14:06:47 Copying to tin.eqiad.wmnet from deployment.eqiad.wmnet
14:06:47 Started rsync common
14:08:43 Finished rsync common (duration: 01m 55s)
14:08:44 Started l10n-update
14:08:44 Updating ExtensionMessages-1.28.0-wmf.6.php
14:08:45 Updating LocalisationCache for 1.28.0-wmf.6 using 4 thread(s)
14:09:20 Generating JSON versions and md5 files
14:09:21 Bootstrapping l10n cache for 1.28.0-wmf.7
14:09:22 Last output:
Warning: require_once(/etc/mediawiki/WikitechPrivateSettings.php): failed to open stream: No such file or directory in /srv/mediawiki-staging/wmf-config/wikitech.php on line 183
Fatal error: require_once(): Failed opening required '/etc/mediawiki/WikitechPrivateSettings.php' (include_path='/srv/mediawiki-staging/php-1.28.0-wmf.7:/usr/local/lib/php:/usr/share/php') in /srv/mediawiki-staging/wmf-config/wikitech.php on line 183
14:09:22 Finished l10n-update (duration: 00m 37s)
14:09:22 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 242, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 304, in main
    return super(Scap, self).main(*extra_args)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 46, in main
    self._before_cluster_sync()
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 326, in _before_cluster_sync
    version, wikidb, self.verbose, self.config)
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 303, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 525, in update_localization_cache
    lang='en', quiet=True)
  File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in _call_rebuildLocalisationCache
    'quiet': '--quiet' if quiet else ''
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 303, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 400, in sudo_check_call
    raise subprocess.CalledProcessError(proc.returncode, cmd)
CalledProcessError: Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_87423667" --threads=4 --lang en  --quiet' returned non-zero exit status 255
14:09:22 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_87423667" --threads=4 --lang en  --quiet' returned non-zero exit status 255 (duration: 02m 58s)

Or rebuildLocalisationCache.php --wiki="labtestwiki" fails due to /etc/mediawiki/WikitechPrivateSettings.php not being on tin..

Wrong testwiki:

-    "labtestwiki": "php-1.28.0-wmf.6",
+    "labtestwiki": "php-1.28.0-wmf.7",

deploying wmf.7 to group2 wikis likely caused a pretty big regression in save timing https://grafana.wikimedia.org/dashboard/db/save-timing back to wmf.7 on group1 only for the time being.

I would blame stashEdit. The rate of api POST went from 22-25k / minutes to 40k.

On https://grafana.wikimedia.org/dashboard/db/api-requests a list let you select the API module to filter on (edit or stashedit) and the graph at the bottom shows the distribution of times per percentiles.

The edit module is barely impacted. The stashedit 75p doubled from ~700 to 1.3 seconds.

My intuition before I actually sleep is that the save-timing board takes in account the stashedit which regressed and I dont think that one as an user effect. API calls to edit show a flat line.

Rollback is probably the safest yeah :-} We would want a new blocking task and figure out who knows about stashEdit.

wmf.7 needs to be fully rolled back due to the CSS/skin loading issues.

Current state is:

The stylesheet issue that was discovered overnight is solved (T138586).

Wiki versions:

group01.28.0-wmf.7
group11.28.0-wmf.7
rest1.28.0-wmf.6

The train is now blocked on the save time regression T138550: 1.28.0-wmf.7 save time regression. We are going to leave it as is over the week-end so people can attempt to figure out the root cause.

If we get a fix available on Monday we will push 1.28.0-wmf.7 on all wikis and then resume the usual train with 1.28.0-wmf.8 cut on Tuesday.

Else, we will most probably freeze the train and postpone the next branch for a week.

RelengTeam is having its weekly meeting on Monday at 4pm UTC and we will definitely talk about this / take a decision.

I have posted the above status update to both wikitech-l and engineering lists.

Mentioned in SAL [2016-06-28T20:09:29Z] <twentyafterfour> deploying https://gerrit.wikimedia.org/r/#/c/296440/ to hopefully unblock wmf.7 deployments. refs T138550, T136973

Mentioned in SAL [2016-06-28T20:09:52Z] <twentyafterfour@tin> Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploying https://gerrit.wikimedia.org/r/#/c/296440/ refs T138550, T136973 (duration: 02m 06s)

Mentioned in SAL [2016-06-28T20:24:28Z] <twentyafterfour@tin> rebuilt wikiversions.php and synchronized wikiversions files: once again rolling back to wmf.6 refs T136973 T138550

Mentioned in SAL [2016-06-28T21:24:51Z] <twentyafterfour> deploying wmf.7 yet again, once CI finishes testing https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973

Mentioned in SAL [2016-06-28T21:31:47Z] <twentyafterfour@tin> Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploy https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973 (duration: 00m 36s)

Change 295339 abandoned by Jforrester:
Group0 to 1.28.0-wmf.7

Reason:
Didn't get used.

https://gerrit.wikimedia.org/r/295339