Page MenuHomePhabricator

Spikes of mediawiki in read only for job runners after altering the s2 slaves topology
Closed, ResolvedPublic

Description

For failover, all slaves of db1024 has been put under db1018. Immediately after that, but before the failover was done, and after it, there has been an increase of mediawiki "read only" errors for mediawiki rpc scalers. There seems that mediawiki detects as if its master database is in read only, or there is lag, but I cannot see any of both things. Could there be any caching in place that makes jobrunners thing that db1024 is still the master? Could there be lag that is not detected by my monitoring? Has MariaDB 10 made the check fail?

Example trace:

{
  "_index": "logstash-2016.02.10",
  "_type": "mediawiki",
  "_id": "AVLKuL4xlAIL90ZzRaZm",
  "_score": null,
  "_source": {
    "message": "Database is read-only: The database has been automatically locked while the slave database servers catch up to the master.",
    "@version": 1,
    "@timestamp": "2016-02-10T10:27:30.000Z",
    "type": "mediawiki",
    "host": "mw1015",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "exception-json"
    ],
    "channel": "exception",
    "normalized_message": "{\"id\":\"2dbad093\",\"type\":\"DBReadOnlyError\",\"file\":\"/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php\",\"line\":789,\"message\":\"Database is read-only: The database has been automatically locked while the slave database servers catch up to the master.\",",
    "url": "/rpc/RunJobs.php?wiki=itwiki&type=refreshLinks&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "9884b15",
    "process_id": 1061,
    "wiki": "itwiki",
    "mwversion": "1.27.0-wmf.12",
    "private": true,
    "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php",
    "line": 789,
    "code": 0,
    "backtrace": [
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php",
        "line": 1505,
        "function": "query",
        "class": "DatabaseBase",
        "type": "->",
        "args": [
          "string",
          "string"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/DBConnRef.php",
        "line": 39,
        "function": "update",
        "class": "DatabaseBase",
        "type": "->",
        "args": [
          "string",
          "array",
          "array",
          "string"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/DBConnRef.php",
        "line": 280,
        "function": "__call",
        "class": "DBConnRef",
        "type": "->",
        "args": [
          "string",
          "array"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php",
        "line": 964,
        "function": "update",
        "class": "DBConnRef",
        "type": "->",
        "args": [
          "string",
          "array",
          "array",
          "string"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php",
        "line": 217,
        "function": "updateLinksTimestamp",
        "class": "LinksUpdate",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php",
        "line": 144,
        "function": "doIncrementalUpdate",
        "class": "LinksUpdate",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/DataUpdate.php",
        "line": 99,
        "function": "doUpdate",
        "class": "LinksUpdate",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/jobs/RefreshLinksJob.php",
        "line": 253,
        "function": "runUpdates",
        "class": "DataUpdate",
        "type": "::",
        "args": [
          "array"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/jobs/RefreshLinksJob.php",
        "line": 114,
        "function": "runForTitle",
        "class": "RefreshLinksJob",
        "type": "->",
        "args": [
          "Title"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/JobRunner.php",
        "line": 262,
        "function": "run",
        "class": "RefreshLinksJob",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/JobRunner.php",
        "line": 176,
        "function": "executeJob",
        "class": "JobRunner",
        "type": "->",
        "args": [
          "RefreshLinksJob",
          "BufferingStatsdDataFactory",
          "integer"
        ]
      },
      {
        "file": "/srv/mediawiki/rpc/RunJobs.php",
        "line": 47,
        "function": "run",
        "class": "JobRunner",
        "type": "->",
        "args": [
          "array"
        ]
      }
    ],
    "exception_id": "2dbad093",
    "class": "mediawiki",
    "message_checksum": "574ca05b75c07c3e0dc56dfb40ea20ba"
  },
  "sort": [
    1455100050000
  ]
}

Jobrunners and jobchroners were restarted yesterday after failover with:

sudo salt -G 'cluster:jobrunner' cmd.run 'service jobrunner status | grep running && service jobrunner restart'
sudo salt -G 'cluster:jobrunner' cmd.run 'service jobchron status | grep running && service jobchron restart'

Details

Related Objects

StatusAssignedTask
Resolvedhashar
Resolvedhashar
ResolvedJoe
ResolvedLegoktm
Declineddemon
ResolvedJoe
ResolvedReedy
ResolvedNone
OpenNone
OpenNone
Resolvedtomasz
Resolvedkaldari
Resolvedkaldari
OpenNone
DeclinedNone
ResolvedNone
ResolvedJoe
ResolvedNone
ResolvedJoe
ResolvedNone
ResolvedJoe
Resolvedkaldari
Resolvedjcrespo
ResolvedVolans
Resolvedaaron
InvalidNone

Event Timeline

jcrespo created this task.Feb 10 2016, 10:39 AM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added subscribers: jcrespo, aaron.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 10 2016, 10:39 AM
jcrespo renamed this task from Spikes of mediawiki in read only for job runners after altering the s2 slaves toplogy to Spikes of mediawiki in read only for job runners after altering the s2 slaves topology.Feb 10 2016, 10:39 AM
jcrespo set Security to None.

@aaron I specifically need your help to discard this is not related to work done related to lag detection and load-balancing.

00:16 <   Krenair> oh, enwikt and itwiki are read-only?
00:17 <   Krenair> not obvious why from tendril but exception.log is showing lots of job runners complaining about it
00:17 <    greg-g> ... still? (they were for the master db switch over)
00:18 <   Krenair> the entries coming in are timestamped 2016-02-11 00:17
00:18 <   Krenair> so yeah

So the current thesis is that *very* long running jobs on job runner mediawikis do not reload its configuration/code immediately, continuing running for days with old versions. I condirmed this by restarting HHVM for mw1015 and that fixed it for it since 12:50 https://logstash.wikimedia.org/#dashboard/temp/AVLPjd2UptxhN1Xab-4J.

@greg This impacts deployments at your team. You may want to either be aware of it or change something (procedures, scap code), as this will impact all deployments that affects the queue processing. This may be one cause why you are seeing requests of old mediawiki versions- not only higher-level caching.

My long-term solution (Re:Databases) will be to migrate away from deploying mediawiki, as I cannot handle *days* with config not taking place.

Restarting HHVM didn't fix it, either there is something else cached wrongly (master host on memcached?) or something on the code identifying lag is failing.

demon added a subscriber: demon.Feb 11 2016, 5:52 PM

So the current thesis is that *very* long running jobs on job runner mediawikis do not reload its configuration/code immediately, continuing running for days with old versions. I condirmed this by restarting HHVM for mw1015 and that fixed it for it since 12:50 https://logstash.wikimedia.org/#dashboard/temp/AVLPjd2UptxhN1Xab-4J.
@greg This impacts deployments at your team. You may want to either be aware of it or change something (procedures, scap code), as this will impact all deployments that affects the queue processing. This may be one cause why you are seeing requests of old mediawiki versions- not only higher-level caching.

So generally speaking we should probably restart job runners to pick up config changes. This was brought up the other day in the codfw meeting and again here. It'd just be good practice. I've filed T126632: Scap should restart job runners to pick up new config for that.

My long-term solution (Re:Databases) will be to migrate away from deploying mediawiki, as I cannot handle *days* with config not taking place.

+1. I think having a way to source DB config from something like etcd would be a worthwhile endeavor. And not too terribly difficult I'd imagine...

Restarting HHVM didn't fix it, either there is something else cached wrongly (master host on memcached?) or something on the code identifying lag is failing.

This is a much deeper problem if components of MediaWiki are storing configuration in a way that isn't easily expired when that config changes.

jcrespo added a comment.EditedFeb 12 2016, 7:49 PM

Some facts:

  • This is now only happening on enwiktionary, not on the other wikis. It happened on other wikis (like itwiki, at least), during the scheduled time (expected), and until 2/11 6:51:58 UTC
  • enwiktionary is not the largest wiki on s2 (pt, pl, nl, it and zh are larger)
  • This is only happening for the job "RefreshLinks"
  • Something similar happened with htmlCacheUpdate with "Could not wait for slaves to catch up to db1024" because mediawiki expects its production slaves to be direct children of the master. That should not be the case. In any case, that does not happen now.
  • There are no differences, that I could see, in grants between db1018 and db1024 (e.g. potentially affecting lag calculation)
  • Job runner are not trying to connect to the old master- I brought db1024 down and there was no complain.

I need to check the code of RefreshLinks to try to debug the error.

This is another view from the logs:

2016-02-16 08:02:07 mw1007 enwiktionary 1.27.0-wmf.13 runJobs ERROR: refreshLinks Module:languages pages=array(1)
rootJobSignature=6fea3b2d617ac2546a0f3e179411515f1e2cbcd3 rootJobTimestamp=20160214234708 masterPos=db1018-bin.001003/143625119 triggeredRecursive=1
(uuid=c0471b4d90a14ae98ffaf63d62bbc0fd,timestamp=1455609590,QueuePartition=rdb3-6379) t=40 error=DBReadOnlyError: Database is read-only: The database has been
automatically locked while the slave database servers catch up to the master.

I found the original cause. The job is waiting for a master log pos that will never arrive (db1024-bin.*), as it points to a different master now (db1018-bin.*), which happens to be lower than the current one:

{
  "_index": "logstash-2016.02.16",
  "_type": "mediawiki",
  "_id": "AVLpe29zptxhN1XaOKM4",
  "_score": null,
  "_source": {
    "message": "LoadBalancer::doWait: Timed out waiting on db1067 pos db1024-bin.002071/824828094:\nLoadBalancer.php line 501 calls wfBacktrace()\nLoadBalancer.php line 377 calls LoadBalancer->doWait()\nRefreshLinksJob.php line 130 calls LoadBalancer->waitFor()\nRefreshLinksJob.php line 111 calls RefreshLinksJob->waitForMasterPosition()\nJobRunner.php line 262 calls RefreshLinksJob->run()\nJobRunner.php line 176 calls JobRunner->executeJob()\nRunJobs.php line 47 calls JobRunner->run()",
    "@version": 1,
    "@timestamp": "2016-02-16T09:48:46.000Z",
    "type": "mediawiki",
    "host": "mw1166",
    "level": "INFO",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "DBPerformance",
    "normalized_message": "LoadBalancer::doWait: Timed out waiting on db1067 pos db1024-bin.002071/824828094:\nLoadBalancer.php line 501 calls wfBacktrace()\nLoadBalancer.php line 377 calls LoadBalancer->doWait()\nRefreshLinksJob.php line 130 calls LoadBalancer->waitFor()\nRefreshLinks",
    "url": "/rpc/RunJobs.php?wiki=enwiktionary&type=refreshLinks&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "wikimedia.org",
    "referrer": null,
    "uid": "02c3422",
    "process_id": 25352,
    "wiki": "enwiktionary",
    "mwversion": "1.27.0-wmf.13",
    "private": false
  },
  "sort": [
    1455616126000
  ]
}

This will happen every time the master is changed, including the codfw failover.

Restricted Application added a project: codfw-rollout. · View Herald TranscriptFeb 16 2016, 10:20 AM

Change 270926 had a related patch set uploaded (by Jcrespo):
Fix waiting for a binlog position when the binlog name has changed

https://gerrit.wikimedia.org/r/270926

@demon, the job runner restart may be still needed, but it wasn't the main cause of problems in this case.

jcrespo triaged this task as High priority.Feb 16 2016, 1:10 PM

This is a mediawiki-core defect that breaks master failovers.

aaron claimed this task.Feb 17 2016, 7:38 PM

Change 271427 had a related patch set uploaded (by Aaron Schulz):
Fixes to masterPosWait()

https://gerrit.wikimedia.org/r/271427

This particular instance of errors ended on 2016-02-20T13:54:32.000Z. I do not know if someone did something, but it seems redis run out of will to retry. The general problem still persist, until Aaron's patch is applied.

Change 270926 abandoned by Jcrespo:
Fix waiting for a binlog position when the binlog name has changed

Reason:
abandon in favor of 271427

https://gerrit.wikimedia.org/r/270926

Change 271427 merged by jenkins-bot:
Fixes to masterPosWait() for master switchovers

https://gerrit.wikimedia.org/r/271427

Should this be closed, or should we wait for some testing/confirmation of issues resolved? Has this issue or its fix discarded as nothing to do with T129517 ?

aaron closed this task as Resolved.Mar 16 2016, 4:28 PM

I'll close it. Nothing else to do here after the patch.

T135690 may be a duplicate of this, at least the symptom and the culprit seem related.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM