Page MenuHomePhabricator

Discover and fix under-utlizied replicas
Open, MediumPublic

Assigned To
Authored By
Ladsgroup
Thu, Jun 6, 9:39 PM
Referenced Files
F56241514: DEFAULT_eqiad_10d_filtered.png
Fri, Jul 5, 5:24 PM
F56241512: s5_eqiad_10d_filtered.png
Fri, Jul 5, 5:24 PM
F56241510: s7_eqiad_10d_filtered.png
Fri, Jul 5, 5:24 PM
F56241508: s8_codfw_10d_filtered.png
Fri, Jul 5, 5:24 PM
F56241506: s4_codfw_10d_filtered.png
Fri, Jul 5, 5:24 PM
F55409636: s5_eqiad_ftt_real_sum.png
Mon, Jun 17, 4:28 PM
F55409634: s4_eqiad_ftt_real_sum.png
Mon, Jun 17, 4:28 PM
F55409628: s2_eqiad_ftt_real_sum.png
Mon, Jun 17, 4:28 PM

Description

I wrote a script (P64213) that can compare set of pooled replicas of a section in a given metric. I picked mysql_global_status_innodb_data_read for now but making it work for any other metric is easy.

Here is the result:

sectiondcgraph
s7codfw
s7_codfw.png (1×2 px, 356 KB)
s8codfw
s8_codfw.png (1×2 px, 279 KB)
s2codfw
s2_codfw.png (1×2 px, 260 KB)
s1codfw
s1_codfw.png (1×2 px, 240 KB)
s6codfw
s6_codfw.png (1×2 px, 292 KB)
s3codfw
DEFAULT_codfw.png (1×2 px, 304 KB)
s5codfw
s5_codfw.png (1×2 px, 282 KB)
s4codfw
s4_codfw.png (1×2 px, 463 KB)
s5eqiad
s5_eqiad.png (1×2 px, 168 KB)
s2eqiad
s2_eqiad.png (1×2 px, 335 KB)
s7eqiad
s7_eqiad.png (1×2 px, 305 KB)
s8eqiad
s8_eqiad.png (1×2 px, 219 KB)
s6eqiad
s6_eqiad.png (1×2 px, 317 KB)
s3eqiad
DEFAULT_eqiad.png (1×2 px, 216 KB)
s4eqiad
s4_eqiad.png (1×2 px, 300 KB)
s1eqiad
s1_eqiad.png (1×2 px, 259 KB)

There are some really large spikes (due to maint) that I had to throw away any metrics is quite different from median of the values for that replica. That's why sometimes it looks cut or incomplete.

Also in some cases, the under-utilization is intentional (vslow, dump, etc.) but in some cases it's probably not.

This helps as distribute the load better and make T360930: Section-wide circuit breaking more effective.

Event Timeline

This could also be monitored (i.e. trend deviation from the "pack")

ABran-WMF triaged this task as Medium priority.Fri, Jun 7, 7:00 AM
ABran-WMF moved this task from Triage to In progress on the DBA board.

This could also be monitored (i.e. trend deviation from the "pack")

Yeah indeed.

I start the list of cases that look like they are behind the pack:

db2152s8 - codfw
db2207s2 - codfw
db2173s1 - codfw
db2157s5 - codfw
db1182 and db1246s2 - eqiad
db1227 and db1170s7 - eqiad
db1184 and db1206s1 - eqiad

Be careful when increasing weights. Some of those hosts got those new weights because they got overloaded with "normal" weights for some reason.

Be careful when increasing weights. Some of those hosts got those new weights because they got overloaded with "normal" weights for some reason.

yeah, I know. I check in many aspects and move very slowly.

Sometimes, it's also can be that the previous host had worse hardware or issues but after refresh it's not an issue anymore.

So I used FTT (a type of DFT) to remove the noise and move the information from a time series to frequency domain.
The result is stuff like this:

s8_codfw_ftt_real.png (1×2 px, 61 KB)

But it's not really readable nor comparable. I sum them up (after removing noise) and turn them into a bar chart. Here is the result:

s6_codfw_ftt_real_sum.png (900×1 px, 26 KB)

s4_codfw_ftt_real_sum.png (900×1 px, 31 KB)

s5_codfw_ftt_real_sum.png (900×1 px, 27 KB)

s7_codfw_ftt_real_sum.png (900×1 px, 30 KB)

s8_codfw_ftt_real_sum.png (900×1 px, 28 KB)

s7_eqiad_ftt_real_sum.png (900×1 px, 30 KB)

s6_eqiad_ftt_real_sum.png (900×1 px, 28 KB)

s8_eqiad_ftt_real_sum.png (900×1 px, 28 KB)

s1_eqiad_ftt_real_sum.png (900×1 px, 31 KB)

DEFAULT_eqiad_ftt_real_sum.png (900×1 px, 25 KB)

s2_eqiad_ftt_real_sum.png (900×1 px, 30 KB)

s4_eqiad_ftt_real_sum.png (900×1 px, 24 KB)

s5_eqiad_ftt_real_sum.png (900×1 px, 23 KB)

In most cases the results matches "behind the pack". Also note the data for these graphs is from a different day to make sure on-off maintenance won't affect the numbers.

OTOH, summing amplitudes from mathematics point of view is idiotic so any idea of not doing this while making this work would be appreciated. Maybe an inverse FT on it after removal of noise?

Also, looked at the 10 day rolling average and removed spikes, these show replicas that are behind the pack.

s4_codfw_10d_filtered.png (1×2 px, 465 KB)

s8_codfw_10d_filtered.png (1×2 px, 479 KB)

s7_eqiad_10d_filtered.png (1×2 px, 509 KB)

s5_eqiad_10d_filtered.png (1×2 px, 387 KB)

DEFAULT_eqiad_10d_filtered.png (1×2 px, 434 KB)

s4 codfwdb2172 - db2206
s8 codfwdb2152
s7 eqiaddb1227
s5 eqiaddb1230
s3 eqiaddb1223