As part of trying to verify whether T218692 is fixed, I found the following unexpected messages in the logs:
2020-03-25T14:15:43 Database is read-only: The master database server is running in read-only mode. … 2020-03-25T14:14:30 Server {host} has 15.94904589653 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.946383953094 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.943606853485 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.940775871277 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.938050985336 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.935436010361 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.932633876801 seconds of lag (>= 6) 2020-03-25T14:14:30 Server {host} has 15.929592847824 seconds of lag (>= 6) 2020-03-25T14:14:30 LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode 2020-03-25T14:14:22 LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
It's unexpected that when the DB master is read-only, that Rdbms claims we're read-only because of lagged replicas. However, given that the main write query did fail correctly with a "DB is read-only" message, I've marked T218692 as resolved.
The other messages looked like perhaps a regression from T216496, which was about MW wrongly reporting "lagged replica mode" when there was no master DB.
This is not the case either. It is not wrongly reporting "lagged replica mode". Rather, it is in fact reporting real replication lag and it can be seen above that the code tried many different servers in codfw, all reporting replication lag.
But, @Marostegui confirmed there is no real lag. Not between codfw-master and its replicas, and also not relative to eqiad-master. All well under a second.
So the question is:
- Where are these values coming from?
- Are they result of a bug? If so, we need to fix it.
- Are they real but mean something else? Then we need to rephase this message and then also make sure that it doesn't result in replica-hopping if it isn't real lag.