Since 07:10 AM UTC we've had around 1k errors per minute like:
Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
This is the trace:
@timestamp 2020-08-03T14:22:54 t @version 1 t _id AXO0tDBqMQ_08tQaoG-- t _index logstash-mediawiki-2020.08.03 # _score - t _type mediawiki t channel DBReplication t facility user t host mw2139 t http_method GET t ip 10.192.16.112 t level ERROR t logsource mw2139 t message Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode t mwversion 1.36.0-wmf.2 t normalized_message Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode t phpversion 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1 t program mediawiki t referrer - t reqId 040e7bae-93fc-42fd-a980-57ce6e80d111 t server en.wikipedia.org t servergroup api_appserver t severity err t shard s1 t tags input-kafka-rsyslog-udp-localhost, rsyslog-udp-localhost, kafka, es, es t timestamp 2020-08-03T14:22:54+00:00 t type mediawiki t url /w/api.php t wiki enwiki
They are only happening for enwiki and I have not been able to find exactly what is going on: https://logstash.wikimedia.org/goto/61e27211c2a393503bc0032492b64693
However, they do not correlate with any increase of fatals: https://logstash.wikimedia.org/goto/ce401250163c21e77e591bbc6f5d5779
There is also nothing indicating that we have an enwiki replica misbehaving or this causing any user impact:
https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All&from=now-24h&to=now
https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-24h&to=now
The only thing that matches perfectly at around 07:10 AM UTC is that I stopped replication on s7 codfw master (db2118) for an schema change, but that should not be related I guess, as enwiki doesn't live on s7. centralauth does though.
However that host is now replicating normally and the errors haven't recovered.
At 07:07 AM UTC I gave just a bit more weight to a vslow,dump replica on enwiki:
07:07 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1106 after compression', diff saved to https://phabricator.wikimedia.org/P12139 and previous config saved to /var/cache/conftool/dbconfig/20200803-070702-marostegui.json
However, depooling that host entirely hasn't solved the issue, so probably not related - also that host doesn't show any trace of issues
Any idea what can be going on here?