The WANObjectCache is mammoth of a class with complexity so high that barely anyone can understand and/or contribute to. It is also extremely critical so bugs can lead to major issues.
A big portion of this complexity is keeping track of replication lag and reducing TTL of keys if the lag is too high. I believe this should be removed because:
- High replag used to be a much bigger problem when databases were using HDDs but since now we use SSD, the replag has been almost all the time around 0.2s or below.
- MediaWiki automatically avoids reading from replicas that have high lag and practically depools them.
- Due to above, the only case these code paths get triggered are when all replicas are lagged in which it's an incident and we have bigger problems than some stale data.
- During the incident, it could make things worse by reducing TTL and forcing more db reads.
- It makes WAN code unpredictable and harder to test under such conditions.
- There has been always the assumption that data in WAN is stale, at least for a bit. So I'm failing to see this fixing an actual problem that is hurting users or integrity of the data.
- It couples the cache infrastructure and the database infrastructure breaking encapsulation on two pretty large and complex components of mediawiki and the infrastructure.
It is such a big part of the class that I think I have to do this gradually in multiple patches. Every time I started the clean up, it sprawled out of control.
