pt-heartbeat uses super-user to write to the database, even if it is in read-only mode. This works great to maintain the symetry between eqiad and codfw (where both send heartbeat events everywhere else). After being like that for over a year, this may not be the best model- it is good because the SPOF is unlikely, allows for dc <-> dc linck checking, and makes failovers easy, but super-user mode has issues (not respecting read-only), having to use a root account and making database master failovers more complex beacause its dependency from puppet, while masters being controlled on mediawiki.
One proposed model change is to take pt-heartbeat client outside of the master, duplicating it to avoid SPOFs, making it run from 2 separate places pointing to the deployed mediawiki master (controlled, for example with https://noc.wikimedia.org/db.php?format=json and writing only on the real master, which will be switched automatically no matter the puppet or mediawiki state. The above config will be changed to etcd when it is ready. The db will contain the same structure (maybe dc and other fields are no longer needed?), only the method to write it would change.
This is something that #mediawiki-platform-team and #performance-team should be aware of, but probably no action is needed from them, as it should be transparent with the current application lag checking.
**So this started with an infrastructure-focus problem, but the more I think about this, I am thinking of increasing the scope of the solution.**
GTID has become difficult to work with, and not the once-for-all solution we thought it may be.
One proposal would be to drop its support for replication checking, and use a heartbeat-like solution (some people call it pseudo-gtid), and integrate it into mediawiki code, so it is no longer a wmf-specific setup.
The issue would not be without problems (it would require a polling model, which has some disadvantages), but polling the database is by itself already a problem, as seen on T180918.
The fundamentals of the solution would be:
* migrate pt-heartbeat to a mediawiki script so it is on application layer (we can keep it at infrastructure layer for non-mediawiki services). E.g. maintenance script from maintenance server, witch reads automatically etcd configuration and switches to the configured master based on mediawiki config, and not like now, based on puppet config
* Migrate chronology protector, lag checking and other replication-based checks to heartbeat-based
* Increase the heartbeat frequency (0.1 s between updates ? 0.5?)
* Coordinate a way to check (poll) heartbeat to avoid cache issues (large transactions, overload, cache stampede) and fail correctly on network and hardware issues- This will solve most of T180918