Page MenuHomePhabricator

Lots of Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
Closed, ResolvedPublic

Description

Since 07:10 AM UTC we've had around 1k errors per minute like:

Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode

This is the trace:

 @timestamp	   	2020-08-03T14:22:54
t @version	   	1
t _id	   	AXO0tDBqMQ_08tQaoG--
t _index	   	logstash-mediawiki-2020.08.03
# _score	   	 - 
t _type	   	mediawiki
t channel	   	DBReplication
t facility	   	user
t host	   	mw2139
t http_method	   	GET
t ip	   	10.192.16.112
t level	   	ERROR
t logsource	   	mw2139
t message	   	Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
t mwversion	   	1.36.0-wmf.2
t normalized_message	   	Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
t phpversion	   	7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1
t program	   	mediawiki
t referrer	   	 - 
t reqId	   	040e7bae-93fc-42fd-a980-57ce6e80d111
t server	   	en.wikipedia.org
t servergroup	   	api_appserver
t severity	   	err
t shard	   	s1
t tags	   	input-kafka-rsyslog-udp-localhost, rsyslog-udp-localhost, kafka, es, es
t timestamp	   	2020-08-03T14:22:54+00:00
t type	   	mediawiki
t url	   	/w/api.php
t wiki	   	enwiki

They are only happening for enwiki and I have not been able to find exactly what is going on: https://logstash.wikimedia.org/goto/61e27211c2a393503bc0032492b64693
However, they do not correlate with any increase of fatals: https://logstash.wikimedia.org/goto/ce401250163c21e77e591bbc6f5d5779

There is also nothing indicating that we have an enwiki replica misbehaving or this causing any user impact:

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All&from=now-24h&to=now
https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-24h&to=now

The only thing that matches perfectly at around 07:10 AM UTC is that I stopped replication on s7 codfw master (db2118) for an schema change, but that should not be related I guess, as enwiki doesn't live on s7. centralauth does though.
However that host is now replicating normally and the errors haven't recovered.

At 07:07 AM UTC I gave just a bit more weight to a vslow,dump replica on enwiki:

07:07 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1106 after compression', diff saved to https://phabricator.wikimedia.org/P12139 and previous config saved to /var/cache/conftool/dbconfig/20200803-070702-marostegui.json

However, depooling that host entirely hasn't solved the issue, so probably not related - also that host doesn't show any trace of issues

Any idea what can be going on here?

Event Timeline

And for some reason at 16:00 this has stopped: https://logstash.wikimedia.org/goto/0dc620ac3b3d1f390ec569a75a722917
I would still like to know what could've happened

This is happening again, and it is definitely related to the fact that I have stopped replication on s8 codfw master to do the MCR schema change deployment.
Is this expected? I don't recall a codfw master having replication stopped with MW doing those checks, I could be wrong, but I would like to know if this is expected or has been introduced lately.

However, the original report I made is a big more weird, as it was s1 which was complaining while replication was stopped on s7.

@Marostegui Did this come up through an alert or monitoring query, or did you specifically look for DB-related logs? If it doesn't affect general monitoring dashboards in Kibana, perhaps it is fine as-is?

The message is from a codfw app server. And it does not appear to cause exceptions or fatal errors. If one were to attempt DB writes in codfw during this, then those would cause a fatal error and show up in MW alerts, but then again, DB writes in Codfw would fatal anyway given the "soft" readonly mode we have there already.

I assume that stopping replication from codfw master to local replicas also means that the heartbeat entries with datacenter=eqiad are not coming through, right? If so, then it seems expected that MW will perceive a lagged reality, because it is. This is configured in db-codfw.php where we decided that MW should report lag end-to-end and not merely relative to its local master.

Is the issue that 1k/min is considered too high? If so, too high for what? For Kibana load purposes? Seems fairly low. Nonetheless, I do wonder why there that are that many MW requests in codfw, but perhaps that's just all from health check requests? Do we have 1k/min health checks per DC?

Marostegui lowered the priority of this task from High to Medium.Aug 11 2020, 6:10 AM

@Marostegui Did this come up through an alert or monitoring query, or did you specifically look for DB-related logs? If it doesn't affect general monitoring dashboards in Kibana, perhaps it is fine as-is?

I saw it on this dashboard: https://logstash.wikimedia.org/goto/61e27211c2a393503bc0032492b64693 which we (DBAs) normally monitor.

The message is from a codfw app server. And it does not appear to cause exceptions or fatal errors. If one were to attempt DB writes in codfw during this, then those would cause a fatal error and show up in MW alerts, but then again, DB writes in Codfw would fatal anyway given the "soft" readonly mode we have there already.

Yep, that is what I posted when I created the task, that it doesn't seem to be user facing. As you are also seeing that, I am decreasing the priority of the task. I wanted someone else to also confirm that :-)

I assume that stopping replication from codfw master to local replicas also means that the heartbeat entries with datacenter=eqiad are not coming through, right? If so, then it seems expected that MW will perceive a lagged reality, because it is. This is configured in db-codfw.php where we decided that MW should report lag end-to-end and not merely relative to its local master.

Correct, eqiad's heartbeat won't go thru until we restart codfw's master.

Is the issue that 1k/min is considered too high? If so, too high for what? For Kibana load purposes? Seems fairly low. Nonetheless, I do wonder why there that are that many MW requests in codfw, but perhaps that's just all from health check requests? Do we have 1k/min health checks per DC?

1k/min would be considered high for codfw I would say, eqiad is a different story.
What looks strange to me is the fact that I stopped centralauth master and the wiki that showed up as lagged was enwiki. Although on this last event, stopping wikidata's master, only shows wikidata as lagged, which is what I would have expected with s7 (where centralauth lives).

Lastly, if this expected from MW side I am ok if we want this task to be closed.

Thank you!

1k/min would be considered high for codfw I would say, eqiad is a different story.
What looks strange to me is the fact that I stopped centralauth master and the wiki that showed up as lagged was enwiki. Although on this last event, stopping wikidata's master, only shows wikidata as lagged, which is what I would have expected with s7 (where centralauth lives).

By "showed up as lagged", do you mean that there were mosly logstash messages with wiki:enwiki or did you see this in some graph of MW statsd data or such? I could probably tweak the "all replica DBs lagged" entries to have some extra fields.

1k/min would be considered high for codfw I would say, eqiad is a different story.
What looks strange to me is the fact that I stopped centralauth master and the wiki that showed up as lagged was enwiki. Although on this last event, stopping wikidata's master, only shows wikidata as lagged, which is what I would have expected with s7 (where centralauth lives).

By "showed up as lagged", do you mean that there were mosly logstash messages with wiki:enwiki or did you see this in some graph of MW statsd data or such? I could probably tweak the "all replica DBs lagged" entries to have some extra fields.

It was logstash indeed

Change 627387 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: consistently resolve DB domains early in LoadBalancer

https://gerrit.wikimedia.org/r/627387

Change 627387 merged by jenkins-bot:
[mediawiki/core@master] rdbms: consistently resolve DB domains early in LoadBalancer

https://gerrit.wikimedia.org/r/627387