Page MenuHomePhabricator

Regression: Backend Save Timing raised by about 50%
Closed, DeclinedPublic

Assigned To
None
Authored By
Krinkle
Jun 14 2017, 1:54 PM
Referenced Files
F8467672: Capture d'écran 2017-06-16 15.01.38.png
Jun 16 2017, 1:02 PM
F8467653: graph.png
Jun 16 2017, 1:02 PM
F8467647: Capture d'écran 2017-06-16 14.55.34.png
Jun 16 2017, 1:02 PM
F8467655: Capture d'écran 2017-06-16 14.58.20.png
Jun 16 2017, 1:02 PM
F8459686: Screen Shot 2017-06-14 at 15.34.43.png
Jun 14 2017, 2:40 PM
F8459687: Screen Shot 2017-06-14 at 15.34.57.png
Jun 14 2017, 2:40 PM
F8459468: Screen Shot 2017-06-14 at 15.00.49.png
Jun 14 2017, 2:09 PM
F8459444: Screen Shot 2017-06-14 at 14.51.27.png
Jun 14 2017, 1:54 PM

Description

https://grafana.wikimedia.org/dashboard/db/save-timing?from=1496894179569&to=1497166735726

Screen Shot 2017-06-14 at 14.51.21.png (978×2 px, 250 KB)

Screen Shot 2017-06-14 at 14.51.27.png (988×2 px, 243 KB)

metricbeforeafter
lower80ms80ms
p50320ms500ms
p75470ms600ms
p951.3s3s
p9915s30s

Started around June 8 23:00 or June 9 10:00 (UTC). Probably caused by wmf.4.

Event Timeline

Krinkle triaged this task as Unbreak Now! priority.Jun 14 2017, 1:54 PM

This graph is incomplete, it's gone back down significantly now. Was that branch reverted?

This graph is incomplete, it's gone back down significantly now. Was that branch reverted?

No, it wasn't reverted. But yeah, it went back down Monday morning. Not all the way, but most of it.

https://grafana.wikimedia.org/dashboard/db/save-timing?from=1496894179569&to=1497398400000&refresh=5m&orgId=1

Screen Shot 2017-06-14 at 15.00.49.png (984×2 px, 313 KB)

Server Admin Log entries from when the regression started:

2017-06-08

  • 23:37 demon@tin: rebuilt wikiversions.php and synchronized wikiversions files: remaining wikis to wmf.4

2017-06-09

  • 00:15 mutante: mw1275 depooled (T124956)
  • 00:36 ejegg: disabled banner impressions loader
  • 02:19 l10nupdate@tin: scap sync-l10n completed (1.30.0-wmf.4) (duration: 06m 04s)
  • 02:25 l10nupdate@tin: ResourceLoader cache refresh completed at Fri Jun 9 02:25:29 UTC 2017 (duration 6m 27s)
  • 05:51 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1074 - T166205 (duration: 00m 42s)
  • 05:57 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1056 - T166206 (duration: 00m 41s)
  • 06:41 moritzm: updating mw1161 to HHVM 3.18
  • 06:58 moritzm: updating mw117* to HHVM 3.18+wmf5
  • 07:40 moritzm: upgrade app servers in codfw running HHVM 3.18 to +wmf5
  • 19:46 RainbowSprinkles: mw1299: running scap pull, maybe out of date?
  • 20:12 demon@tin: Synchronized php-1.30.0-wmf.4/extensions/CirrusSearch/includes/Job/DeleteArchive.php: Really fix it this time (duration: 00m 43s)

And from when it ended:

2017-06-12

  • 06:53 moritzm: upgrade remaining app servers running HHVM 3.18 to 3.18.2+wmf5
  • 07:19 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1064 - T166206 (duration: 00m 41s)
  • 07:31 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1060 - T166205 (duration: 00m 41s)
  • 07:31 marostegui: Deploy alter table s2 - db1060 - T166205
  • 07:38 marostegui: Reboot ms-be1008 as xfs is failing
  • 08:31 godog: reboot ms-be1002, load avg slowly creeping up
  • 09:13 moritzm: upgrading mw1236-mw1249 to HHVM 3.18
  • 09:55 moritzm: upgrading mw1221-mw1235 to HHVM 3.18
  • 10:28 moritzm: upgrading mw1250-mw1258 to HHVM 3.18
  • 11:05 moritzm: upgrading job runners mw1162-mw1164 to HHVM 3.18
  • 12:01 moritzm: upgrading mw1266-mw1275 to HHVM 3.18

There seems to be DB work going on, with those DB servers being repooled and depooled. @jcrespo, any DBA work happening over that period that could explain this?

No, I wasn't even working at those times, and having 4 days of holidays, I didn't leave anything ongoing even 24 hours before.

The DB servers that were depooled when it started are not the same ones that were repooled when it ended. It was just regular DB maintenance (alter tables going on).

And the times don't really match any of the start/end spikes I think

It doesn't really seem to correlate, the first big spike for Save Timing is 8 hours after the start of the sharp rise on the graph you've just linked to.

That is the only issue I am aware of regarding databases- pools and depools happen every single day, so I do not see why it would affect those, specially when we never touch the masters or the ES servers. In the past, I asked ori if faster ES servers affected the performance, and I was told the main blockers are some sw triggers and dependencies, with db having close to no impact on saving timings (very small part of the save process).

There were some HHVM upgrades at the time, maybe you can ask about those, but I would honestly focus on metric-driven investigation.

Assuming each of the HHVM upgrades implicitly clears the APC memory on that server, that could be a potential cause as well. (E.g. Increase in save timing latency due to internal cache misses while the MediaWiki server slowly re-populates its APC keys.)

And indeed, there does appear to noticeable change in APC memory size between June 9 and June 12. However, instead of a drop (as I'd expected), there is actually a significant increase in APC memory usage.

T158176: Build / migrate to HHVM 3.18

I don't think this is related to HHVM 3.18; about half of the app servers and API servers in eqiad were running 3.18 since May already, we'd have noticed the effect much earlier than that. Most of HHVM updates logged in SAL were from one 3.18 build to the subsequent ones (since we found quite a few upstream bugs). Updates from 3.12-3.18 were logged as "updating to 3.18" , the others usually specific the updated releases.

The change from +wmf4 to +wmf5 is limited to this upstream patch: https://github.com/facebook/hhvm/commit/22e7acdb22b3d14420f82c7f59c843bd4e6cafcd
That code path is triggered quite rarely (about 1-3 times per day on API servers, 3-5 times per day on job runners and < 1 daily on standard app servers) and would not explain any sunstantial load change.

The incident seems to be over (we'll probably want to wait a bit more before claiming that for certain):

Capture d'écran 2017-06-16 14.55.34.png (537×924 px, 96 KB)

graph.png (415×747 px, 41 KB)

APC usage is still high, we might want to turn that into a separate task:

Capture d'écran 2017-06-16 14.58.20.png (981×1 px, 126 KB)

Although there was a similar spike last fall already:

Capture d'écran 2017-06-16 15.01.38.png (983×1 px, 184 KB)

As far as I can tell this is over and was correlated with general load. We'll reopen if it reoccurs.

I don't think this is related to HHVM 3.18; about half of the app servers and API servers in eqiad were running 3.18 since May already

To clarify, I wasn't suggesting the upgraded HHVM run-time would be slower in anyway. Rather, I was suggesting the upgrade itself was causing a measurable (but logically, temporary) slowdown due to the APC cache naturally being wiped when HHVM is restarted.

For each upgrade, the latency should be higher after the upgrade at first. And when lots of servers are restarted (upgraded) at the same time, this will affect more users, and thus our metrics. This is only natural and unavoidable, but does bring up two points:

  • A server should probably be warmed up using local requests before being put back in the server pool.
  • (Depending on traffic, and given a restart will now require warmup) - we probably don't want to restart as many at the same time.

Anyway, this isn't a new concern and not a major problem for us right now.

I do still wonder, though, why overall memory consumption of APC (as measured from within HHVM, not by the OS) went up very steeply. I'd expect it to drop to 0, not get higher. This might be a bug. Filed T168540 for this.