Page MenuHomePhabricator

1.28.0-wmf.7 save time regression
Closed, ResolvedPublic

Assigned To
Authored By
thcipriani
Jun 23 2016, 10:23 PM
Referenced Files
F4213638: savetimeregression_24hours
Jun 29 2016, 7:18 PM
F4213640: savetimeregression_1month
Jun 29 2016, 7:18 PM
F4210094: rendertime_fixed.png
Jun 28 2016, 9:51 PM
F4209890: take2-time.png
Jun 28 2016, 8:21 PM
F4209883: take2.png
Jun 28 2016, 8:18 PM
F4209880: graphite.wikimedia.org.png
Jun 28 2016, 8:17 PM
F4209870: graphite.wikimedia.org.png
Jun 28 2016, 8:15 PM
F4209758: p75_edit_saving_time
Jun 28 2016, 7:32 PM

Description

After the rollout of wmf.7 to all wikis on 2016-06-23 19:30 backend save time encountered a noticeable regression:

graphite.wikimedia.org.png (308×586 px, 34 KB)

Over last 30 minutes: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1467142260.174&target=MediaWiki.timing.editResponseTime.p75&from=-30minutes

Event Timeline

Moving this comment here so it isn't lost

I would blame stashEdit. The rate of api POST went from 22-25k / minutes to 40k.

On https://grafana.wikimedia.org/dashboard/db/api-requests a list let you select the API module to filter on (edit or stashedit) and the graph at the bottom shows the distribution of times per percentiles.

The edit module is barely impacted. The stashedit 75p doubled from ~700 to 1.3 seconds.

My intuition before I actually sleep is that the save-timing board takes in account the stashedit which regressed and I dont think that one as an user effect. API calls to edit show a flat line.

Rollback is probably the safest yeah :-} We would want a new blocking task and figure out who knows about stashEdit.

May be a red-herring, but the AbuseFilter tokenizercache hit count has a fairly strong correlation to the deploys and there were a fair amount of log errors from AbuseFilter being thrown:

graphite.wikimedia.org.png (308×586 px, 26 KB)

There are hints that AbuseFilter might be to blame. The graph below shows a huge spike of calls to AbuseFilter::filterAction() as reported via MediaWiki.abusefilter.check_stash.*.count

  • Blue: hits
  • Green: store
  • Red: miss
  • Magenta (left axis): tokenizerCache hits

abusefilter-check-stash.png (394×778 px, 76 KB)

MediaWiki hooks counts are apparently no more reported to statsd (they used to be under MediaWiki.hooks.*. Would have been nice to have an idea of how many time the hook AbuseFilter-filterAction is invoked.


The suspicious change in AbuseFilter would be rEABFe91939fb3fdc: Cache AbuseFilter::checkAllFilters during edit stashing for T137698: Plugin AbuseFilter::checkAllFilters pre-caching into the edit stash hook

Also apparently depends on AntiSpoof change https://gerrit.wikimedia.org/r/#/c/294284/

There is a surge of POST requests to API, a view for all modules:

api-POST-requests-rate.png (303×647 px, 50 KB)

Same but solely for the API stashedit module:

api-POST-stashedit-rate.png (303×648 px, 53 KB)

Distribution of timings for stashedit requests:

api-POST-stashedit-timings.png (353×859 px, 88 KB)

So maybe the issue is not so much in AbuseFilter as in some frontend app (VisualEditor maybe?) suddenly causing a huge spike of API stashedit calls. Maybe in VisualEditor when ever you press a key in the edit field that ends up doing a stashedit background query?

@ori made some edit stashing changes in the wikitext editor, which does stashing in the middle of editing. I think visual editor only stashes when the user clicks the first "Save" button and begins typing the edit summary.

This is blocking remaining deployment of 1.28.0-wmf.7 ( T136973 ). If we got a proper solution on Monday we will push it, else we will pause deployment until it is figured out.

mmodell subscribed.

@ori, @matmarex: are there any further clues about this?

greg raised the priority of this task from High to Unbreak Now!.Jun 27 2016, 5:43 PM

This should have been UBN! as it's blocking the train. wmf.7 is still not on Wikipedias (it was rolled back to just group0+1 to test).

As Antoine said, without clarity here we won't cut/deploy a wmf.8 tomorrow/this week.

@ori made some edit stashing changes in the wikitext editor, which does stashing in the middle of editing. I think visual editor only stashes when the user clicks the first "Save" button and begins typing the edit summary.

That change is in wmf.6 (and wmf.5), so that can't be it.

Change 296255 had a related patch set uploaded (by Aaron Schulz):
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296255

Change 296255 merged by jenkins-bot:
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296255

@aaron with that patch merged are you comfortable with rolling forward? (NB: There might be a logging patch as well from T138585 we'd want to include, but that's not written yet).

@aaron with that patch merged are you comfortable with rolling forward? (NB: There might be a logging patch as well from T138585 we'd want to include, but that's not written yet).

Sure.

Mentioned in SAL [2016-06-28T19:32:10Z] <twentyafterfour> Rolling back to wmf.6: T138550 is still a problem

This is now holding up wmf.7 and wmf.8. What do we need to revert in order to unblock deployments?

Change 296440 had a related patch set uploaded (by 20after4):
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296440

Change 296440 merged by jenkins-bot:
Move the filter pre-caching outside of the DB lock

https://gerrit.wikimedia.org/r/296440

Turns out @aaron patches was only in master and not in wmf.7

Mentioned in SAL [2016-06-28T20:09:29Z] <twentyafterfour> deploying https://gerrit.wikimedia.org/r/#/c/296440/ to hopefully unblock wmf.7 deployments. refs T138550, T136973

Mentioned in SAL [2016-06-28T20:09:52Z] <twentyafterfour@tin> Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploying https://gerrit.wikimedia.org/r/#/c/296440/ refs T138550, T136973 (duration: 02m 06s)

Still doesn't look good:

graphite.wikimedia.org.png (308×586 px, 18 KB)

both bumps:

graphite.wikimedia.org.png (492×962 px, 68 KB)

Next attempt at 20:10 UTC this time with the AbuseFilter change:

take2.png (492×962 px, 68 KB)

take2-time.png (308×586 px, 22 KB)

Mentioned in SAL [2016-06-28T20:24:28Z] <twentyafterfour@tin> rebuilt wikiversions.php and synchronized wikiversions files: once again rolling back to wmf.6 refs T136973 T138550

I think the plan should be to backport suspect patches to wmf.6 one by one, so we can isolate it.

With @aaron patch, the rate of stashedit requests hasn't increased (it even went down). So I guess his patch is a good thing to have, or maybe it is unrelated.

The check_stash is sill hit way too much though :(

check_stash will certainly increase since the older branch didn't have the stash so there is no statsd data to send. Deploying enwiki will cause a bump.

The save timing graph is a problem though. I suspect it's the exportAllVars() call in getStashKey() triggering lazy var computation when it should ignore those.

Change 296457 had a related patch set uploaded (by Aaron Schulz):
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296457

Change 296464 had a related patch set uploaded (by 20after4):
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296464

Mentioned in SAL [2016-06-28T21:24:51Z] <twentyafterfour> deploying wmf.7 yet again, once CI finishes testing https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973

Change 296464 merged by jenkins-bot:
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296464

Mentioned in SAL [2016-06-28T21:31:47Z] <twentyafterfour@tin> Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploy https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973 (duration: 00m 36s)

Change 296476 had a related patch set uploaded (by 20after4):
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296476

Change 296476 merged by jenkins-bot:
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296476

Over last 24 hours https://graphite.wikimedia.org/render/?width=800&height=600&target=MediaWiki.timing.editResponseTime.p75&from=-24hours

savetimeregression_24hours (600×800 px, 64 KB)

Over a month for perspective https://graphite.wikimedia.org/render/?width=800&height=600&target=MediaWiki.timing.editResponseTime.p75&from=-1months

savetimeregression_1month (600×800 px, 71 KB)

So it is definitely working. Thanks @aaron for the set of patches that solved it for sure and has kept up on track for 1.28.0-wmf.8 deployment this week.

Change 296457 merged by jenkins-bot:
Avoid using computed variables to determine stash keys

https://gerrit.wikimedia.org/r/296457