For failover, all slaves of db1024 has been put under db1018. Immediately after that, but before the failover was done, and after it, there has been an increase of mediawiki "read only" errors for mediawiki rpc scalers. There seems that mediawiki detects as if its master database is in read only, or there is lag, but I cannot see any of both things. Could there be any caching in place that makes jobrunners thing that db1024 is still the master? Could there be lag that is not detected by my monitoring? Has MariaDB 10 made the check fail?
Example trace:
{ "_index": "logstash-2016.02.10", "_type": "mediawiki", "_id": "AVLKuL4xlAIL90ZzRaZm", "_score": null, "_source": { "message": "Database is read-only: The database has been automatically locked while the slave database servers catch up to the master.", "@version": 1, "@timestamp": "2016-02-10T10:27:30.000Z", "type": "mediawiki", "host": "mw1015", "level": "ERROR", "tags": [ "syslog", "es", "es", "exception-json" ], "channel": "exception", "normalized_message": "{\"id\":\"2dbad093\",\"type\":\"DBReadOnlyError\",\"file\":\"/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php\",\"line\":789,\"message\":\"Database is read-only: The database has been automatically locked while the slave database servers catch up to the master.\",", "url": "/rpc/RunJobs.php?wiki=itwiki&type=refreshLinks&maxtime=30&maxmem=300M", "ip": "127.0.0.1", "http_method": "POST", "server": "127.0.0.1", "referrer": null, "uid": "9884b15", "process_id": 1061, "wiki": "itwiki", "mwversion": "1.27.0-wmf.12", "private": true, "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php", "line": 789, "code": 0, "backtrace": [ { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/Database.php", "line": 1505, "function": "query", "class": "DatabaseBase", "type": "->", "args": [ "string", "string" ] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/DBConnRef.php", "line": 39, "function": "update", "class": "DatabaseBase", "type": "->", "args": [ "string", "array", "array", "string" ] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/db/DBConnRef.php", "line": 280, "function": "__call", "class": "DBConnRef", "type": "->", "args": [ "string", "array" ] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php", "line": 964, "function": "update", "class": "DBConnRef", "type": "->", "args": [ "string", "array", "array", "string" ] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php", "line": 217, "function": "updateLinksTimestamp", "class": "LinksUpdate", "type": "->", "args": [] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/LinksUpdate.php", "line": 144, "function": "doIncrementalUpdate", "class": "LinksUpdate", "type": "->", "args": [] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/deferred/DataUpdate.php", "line": 99, "function": "doUpdate", "class": "LinksUpdate", "type": "->", "args": [] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/jobs/RefreshLinksJob.php", "line": 253, "function": "runUpdates", "class": "DataUpdate", "type": "::", "args": [ "array" ] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/jobs/RefreshLinksJob.php", "line": 114, "function": "runForTitle", "class": "RefreshLinksJob", "type": "->", "args": [ "Title" ] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/JobRunner.php", "line": 262, "function": "run", "class": "RefreshLinksJob", "type": "->", "args": [] }, { "file": "/srv/mediawiki/php-1.27.0-wmf.12/includes/jobqueue/JobRunner.php", "line": 176, "function": "executeJob", "class": "JobRunner", "type": "->", "args": [ "RefreshLinksJob", "BufferingStatsdDataFactory", "integer" ] }, { "file": "/srv/mediawiki/rpc/RunJobs.php", "line": 47, "function": "run", "class": "JobRunner", "type": "->", "args": [ "array" ] } ], "exception_id": "2dbad093", "class": "mediawiki", "message_checksum": "574ca05b75c07c3e0dc56dfb40ea20ba" }, "sort": [ 1455100050000 ] }
Jobrunners and jobchroners were restarted yesterday after failover with:
sudo salt -G 'cluster:jobrunner' cmd.run 'service jobrunner status | grep running && service jobrunner restart' sudo salt -G 'cluster:jobrunner' cmd.run 'service jobchron status | grep running && service jobchron restart'