Discover and fix under-utlizied replicas
Open, MediumPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Thu, Jun 6, 9:39 PM

Description

I wrote a script (P64213) that can compare set of pooled replicas of a section in a given metric. I picked mysql_global_status_innodb_data_read for now but making it work for any other metric is easy.

Here is the result:

section	dc	graph
s7	codfw
s8	codfw
s2	codfw
s1	codfw
s6	codfw
s3	codfw
s5	codfw
s4	codfw
s5	eqiad
s2	eqiad
s7	eqiad
s8	eqiad
s6	eqiad
s3	eqiad
s4	eqiad
s1	eqiad

There are some really large spikes (due to maint) that I had to throw away any metrics is quite different from median of the values for that replica. That's why sometimes it looks cut or incomplete.

Also in some cases, the under-utilization is intentional (vslow, dump, etc.) but in some cases it's probably not.

This helps as distribute the load better and make T360930: Section-wide circuit breaking more effective.

Related Objects

Mentioned Here: P64213 (An Untitled Masterwork)
T360930: Section-wide circuit breaking

Event Timeline

Ladsgroup created this task.Thu, Jun 6, 9:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Jun 6, 9:39 PM

This could also be monitored (i.e. trend deviation from the "pack")

ABran-WMF triaged this task as Medium priority.Fri, Jun 7, 7:00 AM

ABran-WMF moved this task from Triage to In progress on the DBA board.

In T366852#9870054, @ABran-WMF wrote:

This could also be monitored (i.e. trend deviation from the "pack")

Yeah indeed.

I start the list of cases that look like they are behind the pack:

db2152	s8 - codfw
db2207	s2 - codfw
db2173	s1 - codfw
db2157	s5 - codfw
db1182 and db1246	s2 - eqiad
db1227 and db1170	s7 - eqiad
db1184 and db1206	s1 - eqiad

Be careful when increasing weights. Some of those hosts got those new weights because they got overloaded with "normal" weights for some reason.

In T366852#9876948, @Marostegui wrote:

Be careful when increasing weights. Some of those hosts got those new weights because they got overloaded with "normal" weights for some reason.

yeah, I know. I check in many aspects and move very slowly.

Sometimes, it's also can be that the previous host had worse hardware or issues but after refresh it's not an issue anymore.

So I used FTT (a type of DFT) to remove the noise and move the information from a time series to frequency domain.
The result is stuff like this:

But it's not really readable nor comparable. I sum them up (after removing noise) and turn them into a bar chart. Here is the result:

s6_codfw_ftt_real_sum.png (900×1 px, 26 KB)

s4_codfw_ftt_real_sum.png (900×1 px, 31 KB)

s5_codfw_ftt_real_sum.png (900×1 px, 27 KB)

s7_codfw_ftt_real_sum.png (900×1 px, 30 KB)

s8_codfw_ftt_real_sum.png (900×1 px, 28 KB)

s7_eqiad_ftt_real_sum.png (900×1 px, 30 KB)

s6_eqiad_ftt_real_sum.png (900×1 px, 28 KB)

s8_eqiad_ftt_real_sum.png (900×1 px, 28 KB)

s1_eqiad_ftt_real_sum.png (900×1 px, 31 KB)

DEFAULT_eqiad_ftt_real_sum.png (900×1 px, 25 KB)

s2_eqiad_ftt_real_sum.png (900×1 px, 30 KB)

s4_eqiad_ftt_real_sum.png (900×1 px, 24 KB)

s5_eqiad_ftt_real_sum.png (900×1 px, 23 KB)

In most cases the results matches "behind the pack". Also note the data for these graphs is from a different day to make sure on-off maintenance won't affect the numbers.

OTOH, summing amplitudes from mathematics point of view is idiotic so any idea of not doing this while making this work would be appreciated. Maybe an inverse FT on it after removal of noise?

Also, looked at the 10 day rolling average and removed spikes, these show replicas that are behind the pack.

s4_codfw_10d_filtered.png (1×2 px, 465 KB)

s8_codfw_10d_filtered.png (1×2 px, 479 KB)

s7_eqiad_10d_filtered.png (1×2 px, 509 KB)

s5_eqiad_10d_filtered.png (1×2 px, 387 KB)

DEFAULT_eqiad_10d_filtered.png (1×2 px, 434 KB)

s4 codfw	db2172 - db2206
s8 codfw	db2152
s7 eqiad	db1227
s5 eqiad	db1230
s3 eqiad	db1223

	F56241514: DEFAULT_eqiad_10d_filtered.png
	Fri, Jul 5, 5:24 PM

Discover and fix under-utlizied replicasOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Discover and fix under-utlizied replicas
Open, MediumPublic
Actions