Regression: Backend Save Timing raised by about 50%
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Krinkle
	Jun 14 2017, 1:54 PM

Description

https://grafana.wikimedia.org/dashboard/db/save-timing?from=1496894179569&to=1497166735726

Screen Shot 2017-06-14 at 14.51.21.png (978×2 px, 250 KB)

Screen Shot 2017-06-14 at 14.51.27.png (988×2 px, 243 KB)

metric	before	after
lower	80ms	80ms
p50	320ms	500ms
p75	470ms	600ms
p95	1.3s	3s
p99	15s	30s

Started around June 8 23:00 or June 9 10:00 (UTC). Probably caused by wmf.4.

Related Objects

Mentioned In: T168540: Understand APC size increase after HHVM upgrade/restart
T167533: MW-1.30.0-wmf.5 deployment blockers
Mentioned Here: T168540: Understand APC size increase after HHVM upgrade/restart
T158176: Build / migrate to HHVM 3.18
T124956: Rise in "parent, LightProcess exiting" console spam
T166205: Convert unique keys into primary keys for some wiki tables on s2
T166206: Convert unique keys into primary keys for some wiki tables on s4

Event Timeline

Krinkle created this task.Jun 14 2017, 1:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2017, 1:54 PM

Krinkle triaged this task as Unbreak Now! priority.Jun 14 2017, 1:54 PM

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJun 14 2017, 1:54 PM

Krinkle updated the task description. (Show Details)Jun 14 2017, 1:55 PM

Krinkle mentioned this in T167533: MW-1.30.0-wmf.5 deployment blockers.

Krinkle updated the task description. (Show Details)

This graph is incomplete, it's gone back down significantly now. Was that branch reverted?

Paladox subscribed.Jun 14 2017, 1:58 PM

In T167885#3348045, @Gilles wrote:

This graph is incomplete, it's gone back down significantly now. Was that branch reverted?

No, it wasn't reverted. But yeah, it went back down Monday morning. Not all the way, but most of it.

https://grafana.wikimedia.org/dashboard/db/save-timing?from=1496894179569&to=1497398400000&refresh=5m&orgId=1

Screen Shot 2017-06-14 at 15.00.49.png (984×2 px, 313 KB)

Server Admin Log entries from when the regression started:

2017-06-08

23:37 demon@tin: rebuilt wikiversions.php and synchronized wikiversions files: remaining wikis to wmf.4

2017-06-09

00:15 mutante: mw1275 depooled (T124956)

00:36 ejegg: disabled banner impressions loader

02:19 l10nupdate@tin: scap sync-l10n completed (1.30.0-wmf.4) (duration: 06m 04s)

02:25 l10nupdate@tin: ResourceLoader cache refresh completed at Fri Jun 9 02:25:29 UTC 2017 (duration 6m 27s)

05:51 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1074 - T166205 (duration: 00m 42s)

05:57 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1056 - T166206 (duration: 00m 41s)

06:41 moritzm: updating mw1161 to HHVM 3.18

06:58 moritzm: updating mw117* to HHVM 3.18+wmf5

07:40 moritzm: upgrade app servers in codfw running HHVM 3.18 to +wmf5

19:46 RainbowSprinkles: mw1299: running scap pull, maybe out of date?

20:12 demon@tin: Synchronized php-1.30.0-wmf.4/extensions/CirrusSearch/includes/Job/DeleteArchive.php: Really fix it this time (duration: 00m 43s)

And from when it ended:

2017-06-12

06:53 moritzm: upgrade remaining app servers running HHVM 3.18 to 3.18.2+wmf5

07:19 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1064 - T166206 (duration: 00m 41s)

07:31 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1060 - T166205 (duration: 00m 41s)

07:31 marostegui: Deploy alter table s2 - db1060 - T166205

07:38 marostegui: Reboot ms-be1008 as xfs is failing

08:31 godog: reboot ms-be1002, load avg slowly creeping up

09:13 moritzm: upgrading mw1236-mw1249 to HHVM 3.18

09:55 moritzm: upgrading mw1221-mw1235 to HHVM 3.18

10:28 moritzm: upgrading mw1250-mw1258 to HHVM 3.18

11:05 moritzm: upgrading job runners mw1162-mw1164 to HHVM 3.18

12:01 moritzm: upgrading mw1266-mw1275 to HHVM 3.18

There seems to be DB work going on, with those DB servers being repooled and depooled. @jcrespo, any DBA work happening over that period that could explain this?

No, I wasn't even working at those times, and having 4 days of holidays, I didn't leave anything ongoing even 24 hours before.

The DB servers that were depooled when it started are not the same ones that were repooled when it ended. It was just regular DB maintenance (alter tables going on).

And the times don't really match any of the start/end spikes I think

How well does it correlate to this code change on the parsercaches? https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=pc1004&var-network=bond0&from=now-10d&to=now

It doesn't really seem to correlate, the first big spike for Save Timing is 8 hours after the start of the sharp rise on the graph you've just linked to.

That is the only issue I am aware of regarding databases- pools and depools happen every single day, so I do not see why it would affect those, specially when we never touch the masters or the ES servers. In the past, I asked ori if faster ES servers affected the performance, and I was told the main blockers are some sw triggers and dependencies, with db having close to no impact on saving timings (very small part of the save process).

There were some HHVM upgrades at the time, maybe you can ask about those, but I would honestly focus on metric-driven investigation.

Assuming each of the HHVM upgrades implicitly clears the APC memory on that server, that could be a potential cause as well. (E.g. Increase in save timing latency due to internal cache misses while the MediaWiki server slowly re-populates its APC keys.)

And indeed, there does appear to noticeable change in APC memory size between June 9 and June 12. However, instead of a drop (as I'd expected), there is actually a significant increase in APC memory usage.

hhvm-apc-usage
save-timing

T158176: Build / migrate to HHVM 3.18

Jdforrester-WMF subscribed.Jun 14 2017, 2:46 PM

I have added a new aggregated "worse response time .99 percentile" to mysql which may be useful in the future: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?panelId=11&fullscreen&orgId=1&from=now-7d&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

@ssastry noticed this: https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=API+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report Shape seems to match.

I don't think this is related to HHVM 3.18; about half of the app servers and API servers in eqiad were running 3.18 since May already, we'd have noticed the effect much earlier than that. Most of HHVM updates logged in SAL were from one 3.18 build to the subsequent ones (since we found quite a few upstream bugs). Updates from 3.12-3.18 were logged as "updating to 3.18" , the others usually specific the updated releases.

The change from +wmf4 to +wmf5 is limited to this upstream patch: https://github.com/facebook/hhvm/commit/22e7acdb22b3d14420f82c7f59c843bd4e6cafcd
That code path is triggered quite rarely (about 1-3 times per day on API servers, 3-5 times per day on job runners and < 1 daily on standard app servers) and would not explain any sunstantial load change.

The incident seems to be over (we'll probably want to wait a bit more before claiming that for certain):

Capture d'écran 2017-06-16 14.55.34.png (537×924 px, 96 KB)

APC usage is still high, we might want to turn that into a separate task:

Capture d'écran 2017-06-16 14.58.20.png (981×1 px, 126 KB)

Although there was a similar spike last fall already:

Capture d'écran 2017-06-16 15.01.38.png (983×1 px, 184 KB)

• Nikerabbit subscribed.Jun 16 2017, 3:04 PM

As far as I can tell this is over and was correlated with general load. We'll reopen if it reoccurs.

In T167885#3354041, @MoritzMuehlenhoff wrote:

I don't think this is related to HHVM 3.18; about half of the app servers and API servers in eqiad were running 3.18 since May already

To clarify, I wasn't suggesting the upgraded HHVM run-time would be slower in anyway. Rather, I was suggesting the upgrade itself was causing a measurable (but logically, temporary) slowdown due to the APC cache naturally being wiped when HHVM is restarted.

For each upgrade, the latency should be higher after the upgrade at first. And when lots of servers are restarted (upgraded) at the same time, this will affect more users, and thus our metrics. This is only natural and unavoidable, but does bring up two points:

A server should probably be warmed up using local requests before being put back in the server pool.
(Depending on traffic, and given a restart will now require warmup) - we probably don't want to restart as many at the same time.

Anyway, this isn't a new concern and not a major problem for us right now.

I do still wonder, though, why overall memory consumption of APC (as measured from within HHVM, not by the OS) went up very steeply. I'd expect it to drop to 0, not get higher. This might be a bug. Filed T168540 for this.

	F8467672: Capture d'écran 2017-06-16 15.01.38.png
	Jun 16 2017, 1:02 PM

	F8467647: Capture d'écran 2017-06-16 14.55.34.png
	Jun 16 2017, 1:02 PM

	F8467655: Capture d'écran 2017-06-16 14.58.20.png
	Jun 16 2017, 1:02 PM

	F8459686: Screen Shot 2017-06-14 at 15.34.43.png
	Jun 14 2017, 2:40 PM

	F8459687: Screen Shot 2017-06-14 at 15.34.57.png
	Jun 14 2017, 2:40 PM

	F8459468: Screen Shot 2017-06-14 at 15.00.49.png
	Jun 14 2017, 2:09 PM

	F8459444: Screen Shot 2017-06-14 at 14.51.27.png
	Jun 14 2017, 1:54 PM

Regression: Backend Save Timing raised by about 50%Closed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Regression: Backend Save Timing raised by about 50%
Closed, DeclinedPublic
Actions