Page MenuHomePhabricator

Ensure Changeprop is disabled when the databases are in read only mode
Closed, ResolvedPublic

Description

During a recent incident we made the databases servers read only and it was suggested that when doing this we should also ensure that change prope is disabled. As such we should at a minimum update relevant wikis and documentation to ensure changeprop is disabled in theses instance or potentially add some functionality to dbctl which could disable Changeprop when switching databases to Readonly

Event Timeline

To be clear, the idea came out of the fact that during read-only time we had a lot of jobs failing, but given we actually retry the jobs, we should not need to actually.

According to https://github.com/wikimedia/change-propagation/blob/16c24306d9e6546ea240f190b38aef28ef33339f/lib/retry_executor.js#L66-L71 and the code below, we do actually retry a bunch of time later in the case the database is in read-only state. As it looks, it adds an additional delay between 30 seconds and 1 minute for each retry, in addition to the default retry time that is retrying after 1 and 6 minutes.

So I think we should mainly document that if a read-only phase lasts more than 10 minutes, we might be better off turning changeprop-jobqueue off at that point.

I'd ask @Pchelolo to confirm I'm reading the code right, though :)

yeah, that's correct. We can increase the additional delay if needed. Also, this particular additional delay is applied after a retry before next retry - so it's like

<job> - 1 min - <retry> - 30-60s - 6 min - <retry>

I can easily shift this additional retry around, make them larger, etc. Maybe we should do additional delay before executing a retry instead of after - that way we only delay when the retry is already in the queue - less chance to loose it.

Joe claimed this task.

Nothing actionable left on this task.