Page MenuHomePhabricator

Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
Closed, ResolvedPublic

Description

The problem: The beta cluster is continuously switching to read only mode after upgrade work has finished.

Screen Shot 2018-11-29 at 10.45.12 AM.png (262×1 px, 67 KB)

Note the downwards spike while the work was being carried on but how it continues long after at a rate of 300 events every 3 hours (all BC wikis): https://logstash-beta.wmflabs.org/goto/3939eafc6573af921c4231766d81982f

Implications: The entire mobile site (and various other extensions) are without reliable browser test coverage.

More background

Over the last few weeks the beta cluster was in read only mode. I am told this is no longer the case, but we are consistently seeing failures in our browser tests stating "The wiki is currently in read-only mode. (readonly) (MediawikiApi::ApiError)".

The wiki is currently in read-only mode. (readonly) (MediawikiApi::ApiError)
/mnt/home/jenkins-deploy/.gem/2.1.0/gems/mediawiki_api-0.7.0/lib/mediawiki_api/client.rb:211:in `send_request'
/mnt/home/jenkins-deploy/.gem/2.1.0/gems/mediawiki_api-0.7.0/lib/mediawiki_api/client.rb:232:in `raw_action'
/mnt/home/jenkins-deploy/.gem/2.1.0/gems/mediawiki_api-0.7.0/lib/mediawiki_api/client.rb:36:in `action'
/mnt/home/jenkins-deploy/.gem/2.1.0/gems/mediawiki_api-0.7.0/lib/mediawiki_api/client.rb:111:in `edit'
/mnt/home/jenkins-deploy/.gem/2.1.0/gems/mediawiki_api-0.7.0/lib/mediawiki_api/client.rb:103:in `create_page'

This is high priority from web side as every day this remains broken we are without test coverage for a key part of the site and we are out of our depth here.

Event Timeline

@zeljkofilipin and @Anomie you seem like the right experts to ask about my 2 questions. Can you shed any light on what could be going wrong here?

Oddly enough.. I ran the exact same tests locally and aside from some known issues with toasts "I should see a toast notification " the tests all passed and I did not hit the read only issue. Does Quibble have some kind of caching layer?

FWIW from a Beta-Cluster-Infrastructure perspective, editing appears to be working both in the UI and API. If you can find the HTTP request that actually gets sent to the beta cluster (preferably in the form of a curl command or similar) that triggers this, even inconsistently, I can take a look, but I don't know enough about your client setup to debug from here right now.

FWIW from a Beta-Cluster-Infrastructure perspective, editing appears to be working both in the UI and API. If you can find the HTTP request that actually gets sent to the beta cluster (preferably in the form of a curl command or similar) that triggers this, even inconsistently, I can take a look, but I don't know enough about your client setup to debug from here right now.

Not sure how exactly these tests run... but I suspect either Saucelabs or the driver that runs these test has some kind of caching in place. I have not been able to replicate against the beta cluster via running the tests from the command line or manually executing the API queries. I suspect the browser testing infrastructure here is what's at fault... with some more debugging or time we should be able to get to the bottom of this.

Removing MediaWiki-Action-API, there doesn't seem to be anything here related to that tag.

If replica databases are lagged, MediaWiki will automatically switch to read-only mode, perhaps you're running into that. This particularly applies if your test user is flagged as a bot. You might check the logson deployment-fluorine02.deployment-prep.eqiad.wmflabs or https://logstash-beta.wmflabs.org/ after a failing run to see if anything interesting was logged. You might also have your test framework log the 'readonlyreason' field from the API error response, which should give you a better idea of why exactly it's read-only.

I noticed that the Selenium Sauce Labs jobs seem to pass when on saucelabs-03 and fail on saucelabs-01. Can we try running a couple selenium-MinervaNeue jobs on saucelabs-03?

Jdlrobson renamed this task from Ruby browser tests failing consistently with "The wiki is currently in read-only mode. (readonly) (MediawikiApi::ApiError)" to Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode.Nov 29 2018, 6:46 PM
Jdlrobson raised the priority of this task from High to Unbreak Now!.
Jdlrobson edited projects, added Beta-Cluster-Infrastructure, SRE; removed Quibble.
Jdlrobson updated the task description. (Show Details)

Thanks @Anomie for that insight - that's super helpful and after looking into logstash-beta today it looks like this is exactly what's happening.

RelEng there is clearly something wrong with the beta cluster post migration. I've updated the description with the logstash URL. Given the impact this is happening on testing (and the severe regressions we've already seen this week e.g. T210670) I've bumped this to unbreak now.

Over the last few weeks the beta cluster was in read only mode.

This isn't true, afaict. There was the migration, but that started and ended last week (with some smaller follow-up this week).

But let's not quibble (get it) about that assertion, the Beta Cluster is indeed experiencing more read-only mode since the migration:

https://logstash-beta.wmflabs.org/goto/9075ea98170e327ef430c74cd55e83e0

That ^ is the same query as what @Jdlrobson added but viewing the past 2 weeks, which shows nothing before the migration, the obvious bump when it was migrating, and a continued low level after.

https://logstash-beta.wmflabs.org/goto/9075ea98170e327ef430c74cd55e83e0 (this is the same query as what @Jdlrobson added but viewing the past 2 weeks, which shows nothing before the migration, the obvious bump when it was migrating, and a continued low level after.

Screen Shot 2018-11-29 at 10.57.24 AM.png (205×914 px, 30 KB)

Thanks @greg for that link (screenshot of graph below to save others a click)! Is this something RelEng can investigate or are there other teams that should be pulled in here?

I'm asking people in RelEng, but this may need more direct mediawiki db interaction experience than we have.

Can we please try to find out what requests are being sent that trigger it?

Looking at SHOW SLAVE STATUS\G on deployment-db04 shows no current lag according to seconds_behind_master, but I'm looking into where we might see its historical values. I thought there was a prometheus collector set up for beta's maria DBs, but I'm not whether it monitors master/slave replication status.

In the meantime, @Krenair has set up a cronjob that monitors seconds_behind_master so we'll have that in case this happens again soon.

I'm also wondering how MediaWiki detects replication lag. Does it look at all DB hosts? And which MaridDB variables does it inspect? Is it possible that it's looking at deployment-db03 and errantly detecting it as a lagged replica?

Looking at SHOW SLAVE STATUS deployment-db03 does indicate that it's half configured as a replica, just not replicating due to it pointing to itself as master and an IO failure.

root@BETA[(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: deployment-db03
                  Master_User:
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: deployment-db03-relay-bin.000005
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 248
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1593
                Last_IO_Error: Fatal error: Invalid (empty) username when attempting to connect to the master server. Connection attempt te
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:

Sorry, I posted the wrong server's output. :) Editing...

Looking at @Krenair 's log of replication lag, there's no indication of lag at any given time since the first entry ("Thu Nov 29 20:13:01 UTC 2018").

However, since that runs on a cron (every minute) it's very possible that lag is spiking intermittently between samples and then subsiding. To rule out that possibility, I've started a process on deployment-deploy01 that's running mwscript getLagTimes.php --wiki=enwiki --report every 5 seconds. The --report option seems to tell it to send the lag statistics to statsd, so it should be fairly easy to see whether there is indeed replication lag during the same time periods in which these errors occur.

I've verified the gauge values are making it to Labs Grafana as BetaMediaWiki.loadbalancer.lag.external.172.16.5.5.3306 and BetaMediaWiki.loadbalancer.lag.external.172.16.5.23.3306 for the replica and master, respectively. (I can't think of a reason why the master should ever indicate lag, but we have the value in any case.) There's also a simple dashboard.

dduvall added a subscriber: thcipriani.

AFAICT, this and related errors are no longer occurring. It's not yet clear exactly what the underlying issue might have been, but in troubleshooting with @thcipriani yesterday we noticed a few things that might have had an impact. Since @thcipriani graciously jumped in to take care of the actual fixes, he might want/need to correct some of this information. :)

  • deployment-mwmaint01 was missing from the scap targets on deployment-deploy01 and thus had outdated configuration; Database loads were configured with the old IP addresses of deployment-db03/-db04 prior to the migration. This seemed responsible for database connection errors we were seeing in the logs while troubleshooting the read-only issue.
  • MW cache servers were misconfigured on some deployment-prep instances due in part to the above missing scap target but also possibly related to partial deployment of T210030: RedisBagOStuff is broken on beta. I'm not sure if caching was completely inoperable or semi-working, but since the database LoadBalancer and LoadMonitor both rely on some cached values for waiting on replication to catch up (master binlog position and lag values), it seems possible that a misconfiguration of the centralized cache could result in miscalculation of replication lag by MediaWiki. This would at least explain why MediaWiki was reporting wildly different values for lag time than was recorded by logging attempts mentioned previously in this task.
  • We weren't able to verify what, if any, configuration for beta cluster resides in etcd or its correctness.

Marking this resolved for now. Note that I'm still seeing errors occurring that seem related to missing redis servers ("Failed connecting to redis server at deployment-redis05.deployment-prep.eqiad.wmflabs") but they seem related to the redis_lock configuration and should have additional tasks filed.

Note that I'm still seeing errors occurring that seem related to missing redis servers ("Failed connecting to redis server at deployment-redis05.deployment-prep.eqiad.wmflabs") but they seem related to the redis_lock configuration and should have additional tasks filed.

(T210301: Status of deployment-redis0[56]?)