Page MenuHomePhabricator

Wikidata and dewiki databases locked
Closed, ResolvedPublic

Description

Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later.

The system administrator who locked it offered this explanation: The database master is running in read-only mode.

Please unlock!


Unlocked. Maybe a post-mortem is needed.

Event Timeline

Esc3300 created this task.Jul 28 2017, 6:16 AM
Restricted Application added subscribers: PokestarFan, Aklapper. · View Herald TranscriptJul 28 2017, 6:16 AM
Esc3300 triaged this task as Unbreak Now! priority.Jul 28 2017, 6:16 AM
Esc3300 added a subscriber: Lydia_Pintscher.
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptJul 28 2017, 6:16 AM

Last edit: 05:48
Current time: 06:17

dewiki is read-only since 5:48 as well.

Wikivoyage seems to work.

Dutch Wikipedia also works. So it's not all projects.

greg added a subscriber: greg.Jul 28 2017, 7:02 AM

Database crashed, it should be ok to edit now.

Joe added a subscriber: Joe.Jul 28 2017, 7:05 AM

I just did two test edits, I can confirm it works.

Esc3300 added a comment.EditedJul 28 2017, 7:12 AM

Yes, it's back! Thanks for your help.

(diff | hist) . . N Heieren vestre (Q33590573)‎; 07:03 . . (+159)‎ . . ‎Haros (talk | contribs)‎ (‎Created a new item: #quickstatements) (Tag: Widar [1.4])
(diff | hist) . . Stenbadan (Q21977112)‎; 07:03 . . (-115)‎ . . ‎Kitayama (talk | contribs)‎ (‎Page on [svwiki] deleted: Stenbådan (syd Brändö, Åland)) [rollback]
(diff | hist) . . 99minutos.com (Q33542455)‎; 07:03 . . (-95)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: 99minutos.com) [rollback]
(diff | hist) . . Sughada (Q19899419)‎; 07:03 . . (-88)‎ . . ‎SoWhy (talk | contribs)‎ (‎Page on [enwiki] deleted: Sughada) [rollback]
(diff | hist) . . russians in germany (Q33526978)‎; 07:03 . . (+1)‎ . . ‎ALDO CP (talk | contribs)‎ (‎Page moved from [frwiki:Russe d'Allemagne] to [frwiki:Russes d'Allemagne]) [rollback]
(diff | hist) . . RADIO YI (Q33549953)‎; 07:03 . . (-89)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: RADIO YI) [rollback]
(diff | hist) . . Misiones Nazarenas Internacionales (Q16608007)‎; 07:03 . . (-116)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: Misiones Nazarenas Internacionales) [rollback]


(diff | hist) . . Template:TOC right (Q5626794)‎; 05:48 . . (+1)‎ . . ‎Cycn (talk | contribs)‎ (‎Changed English label: Template:TOC right)
(diff | hist) . . Nathathupatti Gram Panchayat (Q23744541)‎; 05:48 . . (+87)‎ . . ‎Info-farmer (talk | contribs)‎ (‎Updated item: + en.label (Nathathupatti Gram Panchayat) [rollback]
(diff | hist) . . Undersökning af hittills bekanta tantalhaltiga fossiliers sammansåttning (Q33589754)‎; 05:48 . . (+428)‎ . . ‎Chris.urs-o (talk | contribs)‎ (‎Created claim: main subject (P921): mineralogy (Q83353))
(diff | hist) . . Ottawa Porchfest (Q33590321)‎; 05:47 . . (+365)‎ . . ‎Devon Fyson (talk | contribs)‎ (‎Created claim: official website (P856): http://ottawaporchfest.ca)
(diff | hist) . . Jun Maeda (Q1713232)‎; 05:47 . . (+425)‎ . . ‎Innotata (talk | contribs)‎ (‎Created claim: place of death (P20): Manchester (Q18125)) [rollback]
Esc3300 lowered the priority of this task from Unbreak Now! to Needs Triage.Jul 28 2017, 7:12 AM
Esc3300 updated the task description. (Show Details)
jcrespo claimed this task.Jul 28 2017, 9:03 AM

Investigation is not over, here is what we have found out for now of the causes:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-only

jcrespo renamed this task from Wikidata database locked to Wikidata and dewiki databases locked.Jul 28 2017, 9:11 AM
Elitre added a subscriber: Elitre.Jul 28 2017, 12:29 PM

I've almost finished the above incident documentation. However, I am unsure about which are the right actionables and their priorities (last section).

let's use this ticket to agree on what would be the best followup, a) making puppet change read-only state of the db server automatically, or b) monitoring read-only status on the master databases, or c) any other option

From my side, I would prefer option "b" (monitoring read-only status on the active masters)
My reasoning for this is:

I wouldn't like puppet to automatically change settings, specially on the masters. And if a master crashes, I want to investigate why it crashed (in case it can repeat the crash as soon as it gets pooled) and make sure it came back in a healthy state before letting it be a writable master again.

mark added a subscriber: mark.Aug 1 2017, 11:36 AM

I agree; there's a very good reason for setting masters to read-only when something happened, because it needs manual intervention to investigate whether it's safe to go read-write again. Any automation to do that should be REALLY thoroughly thought through, covering all cases, and it seems doubtful we're able to do that now, especially with naive Puppet manifests.

However, on the other end, monitoring and (where appropriate) alerting on masters that are read-only but shouldn't be, OR MediaWiki hitting read-only databases when they shouldn't (aside from lag and other causes) may be feasible.

I have started working on more complete monitoring, useful if we go over the route of human monitoring rather than automation, here is one example:

$ ./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16295390s, read_only: False, s1 lag: 0.00s, 34 client(s), 886.72 QPS, connection latency: 0.075968s, query latency: 0.001053s

$ ./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=1
CRIT: read_only: "False", expected "True"; OK: Version 10.0.28-MariaDB, Uptime 16295413s, s1 lag: 0.00s, 40 client(s), 615.42 QPS, connection latency: 0.069848s, query latency: 0.000987s

It is configurable, so it can be adapted to the best possible monitoring we decide:

./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=0 --check_warn_lag=-1
WARN: s1 lag is 0.31s; OK: Version 10.0.28-MariaDB, Uptime 16295662s, read_only: False, 32 client(s), 761.35 QPS, connection latency: 0.066061s, query latency: 0.001263s

Code coming soon.

Change 369397 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add new python3 script to check the health of a server

https://gerrit.wikimedia.org/r/369397

Maybe as an action point for (unlikely) future incidents, when Wikidata goes into read-only the subscriptions mentioned at T171950#3482017 should be queued by the system.

jcrespo added a comment.EditedAug 1 2017, 5:39 PM

Wikidata goes into read-only the subscriptions mentioned

Yes, definitely some extensions in the past did not behave perfectly and do not respect mediawiki's read-only mode- I do not know what is the state of Wikidata, but for what you say, a ticket should be filed so its state is investigated and handled, to better support read-only mode.

Change 369397 merged by Jcrespo:
[operations/puppet@production] mariadb: Add new python3 script to check the health of a server

https://gerrit.wikimedia.org/r/369397

$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad
{"datetime": 1501777331.898183, "ssl_expiration": 1619276854.0, "connection": "ok", "connection_latency": 0.07626748085021973, "ssl": true, "total_queries": 15981662418, "heartbeat": {"s1": 0.400536}, "uptime": 16474250, "version": "10.0.28-MariaDB", "query_latency": 0.001131296157836914, "read_only": false, "threads_connected": 49}

So complete coverage is now available for all hosts- including connection checking (with 1 second timeout), ssl (including expiration time), slave status & heartbeat lag, QPS, read only mode, concurrency, uptime/recent restart, version, query latency and connection latency.

$ check_mariadb.py -h labsdb1009 --slave-status --primary-dc=eqiad
{"total_queries": 431519763, "read_only": false, "query_latency": 0.0015988349914550781, "threads_connected": 3, "version": "10.1.25-MariaDB", "ssl": true, "replication": {"s6": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "db1095": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "s7": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "s2": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}}, "datetime": 1501777578.464523, "connection": "ok", "ssl_expiration": 1626257725.0, "connection_latency": 0.05619382858276367, "uptime": 537221, "heartbeat": {"s3": 0.0, "s6": 0.0, "s7": 0.0, "s5": 0.0, "s1": 0.0, "s4": 0.0, "s2": 0.0}}

$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad --icinga --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16474706s, read_only: False, s1 lag: 0.00s, 39 client(s), 481.09 QPS, connection latency: 0.071849s, query latency: 0.001223s

The pending steps is how to use the tools to minimize downtime in the future.

jcrespo closed this task as Resolved.Aug 4 2017, 3:21 PM

I have created all actionables on both the incident documentation ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-only ) and phabricator- consequently, I have closed this ticket and will start to work on the followups to prevent of minimize effects in the future (databases have been ok for over a week).