Page MenuHomePhabricator

1.28.0-wmf.7 save time regression
Closed, ResolvedPublic

Description

After the rollout of wmf.7 to all wikis on 2016-06-23 19:30 backend save time encountered a noticeable regression:

Over last 30 minutes: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1467142260.174&target=MediaWiki.timing.editResponseTime.p75&from=-30minutes

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 23 2016, 10:23 PM
thcipriani triaged this task as High priority.Jun 23 2016, 10:23 PM

Moving this comment here so it isn't lost

I would blame stashEdit. The rate of api POST went from 22-25k / minutes to 40k.
On https://grafana.wikimedia.org/dashboard/db/api-requests a list let you select the API module to filter on (edit or stashedit) and the graph at the bottom shows the distribution of times per percentiles.
The edit module is barely impacted. The stashedit 75p doubled from ~700 to 1.3 seconds.
My intuition before I actually sleep is that the save-timing board takes in account the stashedit which regressed and I dont think that one as an user effect. API calls to edit show a flat line.
Rollback is probably the safest yeah :-} We would want a new blocking task and figure out who knows about stashEdit.

thcipriani updated the task description. (Show Details)Jun 23 2016, 10:29 PM

May be a red-herring, but the AbuseFilter tokenizercache hit count has a fairly strong correlation to the deploys and there were a fair amount of log errors from AbuseFilter being thrown:

hashar added a subscriber: aaron.Jun 24 2016, 9:11 AM

There are hints that AbuseFilter might be to blame. The graph below shows a huge spike of calls to AbuseFilter::filterAction() as reported via MediaWiki.abusefilter.check_stash.*.count

  • Blue: hits
  • Green: store
  • Red: miss
  • Magenta (left axis): tokenizerCache hits

MediaWiki hooks counts are apparently no more reported to statsd (they used to be under MediaWiki.hooks.*. Would have been nice to have an idea of how many time the hook AbuseFilter-filterAction is invoked.


The suspicious change in AbuseFilter would be rEABFe91939fb3fdc: Cache AbuseFilter::checkAllFilters during edit stashing for T137698: Plugin AbuseFilter::checkAllFilters pre-caching into the edit stash hook

Also apparently depends on AntiSpoof change https://gerrit.wikimedia.org/r/#/c/294284/

There is a surge of POST requests to API, a view for all modules:

Same but solely for the API stashedit module:

Distribution of timings for stashedit requests:

So maybe the issue is not so much in AbuseFilter as in some frontend app (VisualEditor maybe?) suddenly causing a huge spike of API stashedit calls. Maybe in VisualEditor when ever you press a key in the edit field that ends up doing a stashedit background query?

@ori made some edit stashing changes in the wikitext editor, which does stashing in the middle of editing. I think visual editor only stashes when the user clicks the first "Save" button and begins typing the edit summary.

This is blocking remaining deployment of 1.28.0-wmf.7 ( T136973 ). If we got a proper solution on Monday we will push it, else we will pause deployment until it is figured out.

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptJun 24 2016, 8:21 PM
mmodell claimed this task.Jun 27 2016, 4:15 PM
mmodell added a subscriber: mmodell.

@ori, @matmarex: are there any further clues about this?

mmodell removed mmodell as the assignee of this task.Jun 27 2016, 4:18 PM
greg raised the priority of this task from High to Unbreak Now!.Jun 27 2016, 5:43 PM

This should have been UBN! as it's blocking the train. wmf.7 is still not on Wikipedias (it was rolled back to just group0+1 to test).

As Antoine said, without clarity here we won't cut/deploy a wmf.8 tomorrow/this week.

Restricted Application added subscribers: TerraCodes, Urbanecm. · View Herald TranscriptJun 27 2016, 5:43 PM
ori added a comment.Jun 27 2016, 6:05 PM

@ori made some edit stashing changes in the wikitext editor, which does stashing in the middle of editing. I think visual editor only stashes when the user clicks the first "Save" button and begins typing the edit summary.

That change is in wmf.6 (and wmf.5), so that can't be it.

Change 296255 had a related patch set uploaded (by Aaron Schulz):
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296255

Change 296255 merged by jenkins-bot:
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296255

greg added a comment.Jun 27 2016, 7:22 PM

@aaron with that patch merged are you comfortable with rolling forward? (NB: There might be a logging patch as well from T138585 we'd want to include, but that's not written yet).

aaron added a comment.Jun 27 2016, 7:31 PM

@aaron with that patch merged are you comfortable with rolling forward? (NB: There might be a logging patch as well from T138585 we'd want to include, but that's not written yet).

Sure.

greg assigned this task to aaron.Jun 27 2016, 7:34 PM
hashar updated the task description. (Show Details)Jun 28 2016, 7:31 PM

Mentioned in SAL [2016-06-28T19:32:10Z] <twentyafterfour> Rolling back to wmf.6: T138550 is still a problem

This is now holding up wmf.7 and wmf.8. What do we need to revert in order to unblock deployments?

Change 296440 had a related patch set uploaded (by 20after4):
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296440

Change 296440 merged by jenkins-bot:
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296440

Turns out @aaron patches was only in master and not in wmf.7

Mentioned in SAL [2016-06-28T20:09:29Z] <twentyafterfour> deploying https://gerrit.wikimedia.org/r/#/c/296440/ to hopefully unblock wmf.7 deployments. refs T138550, T136973

Mentioned in SAL [2016-06-28T20:09:52Z] <twentyafterfour@tin> Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploying https://gerrit.wikimedia.org/r/#/c/296440/ refs T138550, T136973 (duration: 02m 06s)

mmodell added a comment.EditedJun 28 2016, 8:15 PM

Still doesn't look good:

both bumps:

hashar added a comment.EditedJun 28 2016, 8:18 PM

Next attempt at 20:10 UTC this time with the AbuseFilter change:

Mentioned in SAL [2016-06-28T20:24:28Z] <twentyafterfour@tin> rebuilt wikiversions.php and synchronized wikiversions files: once again rolling back to wmf.6 refs T136973 T138550

ori added a comment.Jun 28 2016, 8:26 PM

I think the plan should be to backport suspect patches to wmf.6 one by one, so we can isolate it.

With @aaron patch, the rate of stashedit requests hasn't increased (it even went down). So I guess his patch is a good thing to have, or maybe it is unrelated.

The check_stash is sill hit way too much though :(

aaron added a comment.EditedJun 28 2016, 8:34 PM

check_stash will certainly increase since the older branch didn't have the stash so there is no statsd data to send. Deploying enwiki will cause a bump.

The save timing graph is a problem though. I suspect it's the exportAllVars() call in getStashKey() triggering lazy var computation when it should ignore those.

Change 296457 had a related patch set uploaded (by Aaron Schulz):
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296457

Change 296464 had a related patch set uploaded (by 20after4):
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296464

Mentioned in SAL [2016-06-28T21:24:51Z] <twentyafterfour> deploying wmf.7 yet again, once CI finishes testing https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973

Change 296464 merged by jenkins-bot:
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296464

Mentioned in SAL [2016-06-28T21:31:47Z] <twentyafterfour@tin> Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploy https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973 (duration: 00m 36s)

Change 296476 had a related patch set uploaded (by 20after4):
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296476

mmodell closed this task as Resolved.Jun 28 2016, 10:27 PM

Change 296476 merged by jenkins-bot:
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296476

Over last 24 hours https://graphite.wikimedia.org/render/?width=800&height=600&target=MediaWiki.timing.editResponseTime.p75&from=-24hours

Over a month for perspective https://graphite.wikimedia.org/render/?width=800&height=600&target=MediaWiki.timing.editResponseTime.p75&from=-1months

So it is definitely working. Thanks @aaron for the set of patches that solved it for sure and has kept up on track for 1.28.0-wmf.8 deployment this week.

Change 296457 merged by jenkins-bot:
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296457