LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	aaron
	Aug 4 2016, 8:39 PM

Description

See https://phabricator.wikimedia.org/T116425#2523704

The work-around now is to give them a tiny main group load (e.g. 1).

There should be a cleaner way to configure this. Perhaps having LoadBalancer take an array of group names that exempt servers from load checks (unless they have main group load > 0). That could handle vslow/dump. All the rest could be checked then.

Details

	Subject	Repo	Branch	Lines +/-
	Apply workaround for T142135	operations/mediawiki-config	master	+28 -23

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T3268 Database replication lag issues (tracking)
Duplicate	None	T108551 Database locked error while publishing article using CX
Resolved	aaron	T95501 Fix causes of replica lag and get it to under 5 seconds at peak
Declined	aaron	T142135 LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves

Event Timeline

aaron created this task.Aug 4 2016, 8:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2016, 8:39 PM

aaron raised the priority of this task from Medium to High.Aug 4 2016, 8:40 PM

aaron mentioned this in T116425: Rename user creates lag on enwiki.

aaron updated the task description. (Show Details)

ArielGlenn subscribed.Aug 4 2016, 8:54 PM

Danny_B removed a project: Tracking-Neverending.Aug 4 2016, 8:55 PM

Change 303139 had a related patch set uploaded (by Jcrespo):
Apply workaround for T142135

https://gerrit.wikimedia.org/r/303139

gerritbot added a project: Patch-For-Review.Aug 5 2016, 10:19 AM

I am applying https://gerrit.wikimedia.org/r/303139 Despite my description; I do not think it is a hacky solution. Marking something as => 0 if it "should be known but not used under normal circumstances", and with => 1+, "it should keep it in sync and use it" is not a bad idea.

I think if it is documented properly; it is ok.

The reason why this happened is because we now have very powerful new servers, and we want to use those as much as possible because they have way less latency, even under load; so I left the special ones with 0 weight because they were only used for the special roles for separation (availability issues in case of saturation of those roles; performance for buffer pool usage). If it is documented properly on the configuration file, I do not think it is a hackish solution at all (it just broke my expectations). The only reason why this wouldn't work is if something, for example, didn't work on the special slaves (for example, due to the special partitioning), and only did on the rest, which as far as I know it is not the case.

*If* we are going to change how the configuration file works, I would do it differently. It is already incredible complex, with the same node appearing multiple times. It is very easy to not depool a node completely by only taking it out from the "main" section and not from the "roles" section. I think a node should only appear once, and have its roles set explicitly; let me give you a fake example:

'nodes' => [
    'db1024' => [
        's1' => [
            rc_weight => 100,
        ],
    ],
#    'db1026' => [  # fully depooled by commenting a single block, more difficult to create an error
#        's1' => [
#            rc_weight => 100,
#        ],
#        's2' => [
#            rc_weight => 100,
#        ],
#    ],
    'db1076 => [
        's1' => [
            wait_for_lag => 0,
            dump_weight => 1,
            vslow_weight => 1,
        ],
    ]
    'db1083' => [
        's1' => [
            main_weight => 500,
        ],
    ],
]

This may sound silly, but remember a probably do an average of 1 commit/day to this file; any help to avoid making errors will be helpul. Additionally, if in the future this gets automatized be being generated automatically; this format will be easier to translate.

This part, of course, is not part of the scope of this ticket- I frankly believe that documenting the weight will be enough.

• jcrespo mentioned this in T141090: Investigate increase in "readonly" saving errors in Content Translation.Aug 5 2016, 10:58 AM

Change 303139 merged by Jcrespo:
Apply workaround for T142135

https://gerrit.wikimedia.org/r/303139

Mentioned in SAL [2016-08-05T11:03:35Z] <jynus@tin> Synchronized wmf-config/db-codfw.php: Document T142135 (duration: 00m 56s)

Mentioned in SAL [2016-08-05T11:06:18Z] <jynus@tin> Synchronized wmf-config/db-eqiad.php: Document T142135 and apply workaround (duration: 00m 52s)

So now that https://gerrit.wikimedia.org/r/303139 has been deployed, assuming there is not regression because of the weight => 1, and lag issues are effectively reduced, I will close this ticket as resolved.

Then, if my previous comment makes any sense, open a different ticket for that as a feature request with low priority.

Of course, open for suggestion regarding both the issue solution and the proposal config.

I see the config docs were updates. I suppose that is sufficient.

LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slavesClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves
Closed, DeclinedPublic
Actions

Related Objects
Search...