Spikes of mediawiki in read only for job runners after altering the s2 slaves topology
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

For failover, all slaves of db1024 has been put under db1018. Immediately after that, but before the failover was done, and after it, there has been an increase of mediawiki "read only" errors for mediawiki rpc scalers. There seems that mediawiki detects as if its master database is in read only, or there is lag, but I cannot see any of both things. Could there be any caching in place that makes jobrunners thing that db1024 is still the master? Could there be lag that is not detected by my monitoring? Has MariaDB 10 made the check fail?

Example trace:

{
  "_index": "logstash-2016.02.10",
  "_type": "mediawiki",
  "_id": "AVLKuL4xlAIL90ZzRaZm",
  "_score": null,
  "_source": {
    "message": "Database is read-only: The database has been automatically locked while the slave database servers catch up to the master.",
    "@version": 1,
    "@timestamp": "2016-02-10T10:27:30.000Z",
    "type": "mediawiki",
    "host": "mw1015",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "exception-json"
    ],
    "channel": "exception",
    "normalized_message": "{\"id\":\"2dbad093\",\"type\":\"DBReadOnlyError\",\"file\":\"/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php\",\"line\":789,\"message\":\"Database is read-only: The database has been automatically locked while the slave database servers catch up to the master.\",",
    "url": "/rpc/RunJobs.php?wiki=itwiki&type=refreshLinks&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "9884b15",
    "process_id": 1061,
    "wiki": "itwiki",
    "mwversion": "1.27.0-wmf.12",
    "private": true,
    "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php",
    "line": 789,
    "code": 0,
    "backtrace": [
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php",
        "line": 1505,
        "function": "query",
        "class": "DatabaseBase",
        "type": "->",
        "args": [
          "string",
          "string"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/DBConnRef.php",
        "line": 39,
        "function": "update",
        "class": "DatabaseBase",
        "type": "->",
        "args": [
          "string",
          "array",
          "array",
          "string"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/DBConnRef.php",
        "line": 280,
        "function": "__call",
        "class": "DBConnRef",
        "type": "->",
        "args": [
          "string",
          "array"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php",
        "line": 964,
        "function": "update",
        "class": "DBConnRef",
        "type": "->",
        "args": [
          "string",
          "array",
          "array",
          "string"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php",
        "line": 217,
        "function": "updateLinksTimestamp",
        "class": "LinksUpdate",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php",
        "line": 144,
        "function": "doIncrementalUpdate",
        "class": "LinksUpdate",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/DataUpdate.php",
        "line": 99,
        "function": "doUpdate",
        "class": "LinksUpdate",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/jobs/RefreshLinksJob.php",
        "line": 253,
        "function": "runUpdates",
        "class": "DataUpdate",
        "type": "::",
        "args": [
          "array"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/jobs/RefreshLinksJob.php",
        "line": 114,
        "function": "runForTitle",
        "class": "RefreshLinksJob",
        "type": "->",
        "args": [
          "Title"
        ]
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/JobRunner.php",
        "line": 262,
        "function": "run",
        "class": "RefreshLinksJob",
        "type": "->",
        "args": []
      },
      {
        "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/JobRunner.php",
        "line": 176,
        "function": "executeJob",
        "class": "JobRunner",
        "type": "->",
        "args": [
          "RefreshLinksJob",
          "BufferingStatsdDataFactory",
          "integer"
        ]
      },
      {
        "file": "/srv/mediawiki/rpc/RunJobs.php",
        "line": 47,
        "function": "run",
        "class": "JobRunner",
        "type": "->",
        "args": [
          "array"
        ]
      }
    ],
    "exception_id": "2dbad093",
    "class": "mediawiki",
    "message_checksum": "574ca05b75c07c3e0dc56dfb40ea20ba"
  },
  "sort": [
    1455100050000
  ]
}

Jobrunners and jobchroners were restarted yesterday after failover with:

sudo salt -G 'cluster:jobrunner' cmd.run 'service jobrunner status | grep running && service jobrunner restart'
sudo salt -G 'cluster:jobrunner' cmd.run 'service jobchron status | grep running && service jobchron restart'

Details

	Subject	Repo	Branch	Lines +/-
	Fixes to masterPosWait() for master switchovers	mediawiki/core	master	+116 -33
	Fix waiting for a binlog position when the binlog name has changed	mediawiki/core	master	+56 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		hashar	T119138 [keyresult] Migrate majority of CI jobs to Nodepool (part 2)
Resolved		hashar	T119139 [keyresult] Migrate php (Zend and HHVM) CI jobs to Nodepool
Resolved		Joe	T125821 Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia
Resolved		Legoktm	T75901 Drop PHP 5.3 support
Declined		• demon	T91590 [Spike] Try out hack (<?hh) for mediawiki-config
Resolved		Joe	T104147 can we get rid of rsvg security patch?
Resolved		Reedy	T94149 Get rid of Zend 5.5 tests for wmf branches
Resolved		None	T86081 Complete the use of HHVM over Zend PHP on the Wikimedia cluster
Open		None	T32996 Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki
Open	Feature	None	T47443 Deploy language-specific "uca-xx" collations on Wikimedia wikis
Resolved		tomasz	T90689 Set $wgCategoryCollation to 'uca-hsb' on Upper Sorbian Wikipedia (hsb.wp) and rebuild category sort keys
Resolved		kaldari	T128483 Fix category headers for pages that begin with numbers
Resolved		kaldari	T8948 Natural number sorting in category listings
Declined		None	T143669 Sort Umlauts correctly (#17)
Declined		None	T128806 Switch German Wikipedia to uca-de category collation
Resolved		None	T88088 Incorrect sorting in categories on Russian-language projects
Resolved		Joe	T129411 Run `php maintenance/updateCollation.php --force` on all Russian-language projects using uca-ru collation
Resolved		None	T131748 Refresh the appservers puppet code/configs
Resolved		Joe	T131749 Make all role::mediawiki::* classes compatible with debian jessie
Resolved		None	T136281 Broken sorting and multi-page categories for Cyrillic wikis
Resolved		Joe	T86096 Switch HAT appservers to trusty's ICU (or newer)
Resolved		kaldari	T58041 updateCollation.php script prohibitively slow for very large wikis
Resolved		• jcrespo	T130692 Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases
Resolved		Volans	T128353 Switchover to new s3 master
Resolved	PRODUCTION ERROR	aaron	T126436 Spikes of mediawiki in read only for job runners after altering the s2 slaves topology
Invalid		None	T126632 Scap should restart job runners to pick up new config

Event Timeline

• jcrespo created this task.Feb 10 2016, 10:39 AM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added projects: MediaWiki-libs-Rdbms, WMF-JobQueue.

• jcrespo added subscribers: • jcrespo, aaron.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 10 2016, 10:39 AM

@aaron I specifically need your help to discard this is not related to work done related to lag detection and load-balancing.

• jcrespo mentioned this in T125215: Prepare db1018 and s2-slaves for s2 master failover.Feb 10 2016, 2:17 PM

00:16 <   Krenair> oh, enwikt and itwiki are read-only?
00:17 <   Krenair> not obvious why from tendril but exception.log is showing lots of job runners complaining about it
00:17 <    greg-g> ... still? (they were for the master db switch over)
00:18 <   Krenair> the entries coming in are timestamped 2016-02-11 00:17
00:18 <   Krenair> so yeah

So the current thesis is that *very* long running jobs on job runner mediawikis do not reload its configuration/code immediately, continuing running for days with old versions. I condirmed this by restarting HHVM for mw1015 and that fixed it for it since 12:50 https://logstash.wikimedia.org/#dashboard/temp/AVLPjd2UptxhN1Xab-4J.

@greg This impacts deployments at your team. You may want to either be aware of it or change something (procedures, scap code), as this will impact all deployments that affects the queue processing. This may be one cause why you are seeing requests of old mediawiki versions- not only higher-level caching.

My long-term solution (Re:Databases) will be to migrate away from deploying mediawiki, as I cannot handle *days* with config not taking place.

Restarting HHVM didn't fix it, either there is something else cached wrongly (master host on memcached?) or something on the code identifying lag is failing.

In T126436#2018002, @jcrespo wrote:

So the current thesis is that *very* long running jobs on job runner mediawikis do not reload its configuration/code immediately, continuing running for days with old versions. I condirmed this by restarting HHVM for mw1015 and that fixed it for it since 12:50 https://logstash.wikimedia.org/#dashboard/temp/AVLPjd2UptxhN1Xab-4J.

@greg This impacts deployments at your team. You may want to either be aware of it or change something (procedures, scap code), as this will impact all deployments that affects the queue processing. This may be one cause why you are seeing requests of old mediawiki versions- not only higher-level caching.

So generally speaking we should probably restart job runners to pick up config changes. This was brought up the other day in the codfw meeting and again here. It'd just be good practice. I've filed T126632: Scap should restart job runners to pick up new config for that.

My long-term solution (Re:Databases) will be to migrate away from deploying mediawiki, as I cannot handle *days* with config not taking place.

+1. I think having a way to source DB config from something like etcd would be a worthwhile endeavor. And not too terribly difficult I'd imagine...

In T126436#2018319, @jcrespo wrote:

Restarting HHVM didn't fix it, either there is something else cached wrongly (master host on memcached?) or something on the code identifying lag is failing.

This is a much deeper problem if components of MediaWiki are storing configuration in a way that isn't easily expired when that config changes.

Some facts:

This is now only happening on enwiktionary, not on the other wikis. It happened on other wikis (like itwiki, at least), during the scheduled time (expected), and until 2/11 6:51:58 UTC
enwiktionary is not the largest wiki on s2 (pt, pl, nl, it and zh are larger)
This is only happening for the job "RefreshLinks"
Something similar happened with htmlCacheUpdate with "Could not wait for slaves to catch up to db1024" because mediawiki expects its production slaves to be direct children of the master. That should not be the case. In any case, that does not happen now.
There are no differences, that I could see, in grants between db1018 and db1024 (e.g. potentially affecting lag calculation)
Job runner are not trying to connect to the old master- I brought db1024 down and there was no complain.

I need to check the code of RefreshLinks to try to debug the error.

This is another view from the logs:

2016-02-16 08:02:07 mw1007 enwiktionary 1.27.0-wmf.13 runJobs ERROR: refreshLinks Module:languages pages=array(1)
rootJobSignature=6fea3b2d617ac2546a0f3e179411515f1e2cbcd3 rootJobTimestamp=20160214234708 masterPos=db1018-bin.001003/143625119 triggeredRecursive=1
(uuid=c0471b4d90a14ae98ffaf63d62bbc0fd,timestamp=1455609590,QueuePartition=rdb3-6379) t=40 error=DBReadOnlyError: Database is read-only: The database has been
automatically locked while the slave database servers catch up to the master.

I found the original cause. The job is waiting for a master log pos that will never arrive (db1024-bin.*), as it points to a different master now (db1018-bin.*), which happens to be lower than the current one:

{
  "_index": "logstash-2016.02.16",
  "_type": "mediawiki",
  "_id": "AVLpe29zptxhN1XaOKM4",
  "_score": null,
  "_source": {
    "message": "LoadBalancer::doWait: Timed out waiting on db1067 pos db1024-bin.002071/824828094:\nLoadBalancer.php line 501 calls wfBacktrace()\nLoadBalancer.php line 377 calls LoadBalancer->doWait()\nRefreshLinksJob.php line 130 calls LoadBalancer->waitFor()\nRefreshLinksJob.php line 111 calls RefreshLinksJob->waitForMasterPosition()\nJobRunner.php line 262 calls RefreshLinksJob->run()\nJobRunner.php line 176 calls JobRunner->executeJob()\nRunJobs.php line 47 calls JobRunner->run()",
    "@version": 1,
    "@timestamp": "2016-02-16T09:48:46.000Z",
    "type": "mediawiki",
    "host": "mw1166",
    "level": "INFO",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "DBPerformance",
    "normalized_message": "LoadBalancer::doWait: Timed out waiting on db1067 pos db1024-bin.002071/824828094:\nLoadBalancer.php line 501 calls wfBacktrace()\nLoadBalancer.php line 377 calls LoadBalancer->doWait()\nRefreshLinksJob.php line 130 calls LoadBalancer->waitFor()\nRefreshLinks",
    "url": "/rpc/RunJobs.php?wiki=enwiktionary&type=refreshLinks&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "wikimedia.org",
    "referrer": null,
    "uid": "02c3422",
    "process_id": 25352,
    "wiki": "enwiktionary",
    "mwversion": "1.27.0-wmf.13",
    "private": false
  },
  "sort": [
    1455616126000
  ]
}

This will happen every time the master is changed, including the codfw failover.

Restricted Application added a project: codfw-rollout. · View Herald TranscriptFeb 16 2016, 10:20 AM

Change 270926 had a related patch set uploaded (by Jcrespo):
Fix waiting for a binlog position when the binlog name has changed

https://gerrit.wikimedia.org/r/270926

gerritbot added a project: Patch-For-Review.Feb 16 2016, 11:16 AM

@demon, the job runner restart may be still needed, but it wasn't the main cause of problems in this case.

This is a mediawiki-core defect that breaks master failovers.

Joe moved this task from Backlog to In Progress on the codfw-rollout-Jan-Mar-2016 board.Feb 16 2016, 4:31 PM

aaron claimed this task.Feb 17 2016, 7:38 PM

Change 271427 had a related patch set uploaded (by Aaron Schulz):
Fixes to masterPosWait()

https://gerrit.wikimedia.org/r/271427

This particular instance of errors ended on 2016-02-20T13:54:32.000Z. I do not know if someone did something, but it seems redis run out of will to retry. The general problem still persist, until Aaron's patch is applied.

• jcrespo added a parent task: T128353: Switchover to new s3 master.Feb 29 2016, 7:49 PM

Change 270926 abandoned by Jcrespo:
Fix waiting for a binlog position when the binlog name has changed

Reason:
abandon in favor of 271427

https://gerrit.wikimedia.org/r/270926

Change 271427 merged by jenkins-bot:
Fixes to masterPosWait() for master switchovers

https://gerrit.wikimedia.org/r/271427

ReleaseTaggerBot added projects: MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2016-03-08_(1.27.0-wmf.16)).Mar 8 2016, 1:00 PM

• demon moved this task from Untriaged to Dec2019/1.35.wmf.10+ on the Wikimedia-production-error board.Mar 12 2016, 12:00 AM

Should this be closed, or should we wait for some testing/confirmation of issues resolved? Has this issue or its fix discarded as nothing to do with T129517 ?

I'll close it. Nothing else to do here after the patch.

Joe closed subtask T126632: Scap should restart job runners to pick up new config as Invalid.Apr 28 2016, 10:12 AM

• demon moved this task from Dec2019/1.35.wmf.10+ to Resolved on the Wikimedia-production-error board.May 10 2016, 8:44 PM

• mmodell mentioned this in T135690: DBReplicationWaitError: Could not wait for slaves to catch up to 10.64.0.7.May 18 2016, 9:52 PM

T135690 may be a duplicate of this, at least the symptom and the culprit seem related.

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM

Spikes of mediawiki in read only for job runners after altering the s2 slaves topologyClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related ObjectsSearch...

Event Timeline

Spikes of mediawiki in read only for job runners after altering the s2 slaves topology
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...