Maniphest T171928

Wikidata and dewiki databases locked
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Esc3300
	Jul 28 2017, 6:16 AM

Description

Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later.

The system administrator who locked it offered this explanation: The database master is running in read-only mode.

Please unlock!

Unlocked. Maybe a post-mortem is needed.

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: Add new python3 script to check the health of a server	operations/puppet	production	+585 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T171928 Wikidata and dewiki databases locked
Resolved	jcrespo	T172489 Monitor read_only on all databases, make it page on masters
Resolved	jcrespo	T172490 Monitor swap/memory usage on databases

Event Timeline

Esc3300 created this task.Jul 28 2017, 6:16 AM

Restricted Application added subscribers: PokestarFan, Aklapper. · View Herald TranscriptJul 28 2017, 6:16 AM

Esc3300 triaged this task as Unbreak Now! priority.Jul 28 2017, 6:16 AM

Esc3300 added a subscriber: Lydia_Pintscher.

Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptJul 28 2017, 6:16 AM

Last edit: 05:48
Current time: 06:17

dewiki is read-only since 5:48 as well.

Esc3300 added a project: SRE.Jul 28 2017, 6:35 AM

Smalyshev subscribed.Jul 28 2017, 6:36 AM

Wikivoyage seems to work.

Dutch Wikipedia also works. So it's not all projects.

greg subscribed.Jul 28 2017, 7:02 AM

Database crashed, it should be ok to edit now.

I just did two test edits, I can confirm it works.

Yes, it's back! Thanks for your help.

(diff | hist) . . N Heieren vestre (Q33590573)‎; 07:03 . . (+159)‎ . . ‎Haros (talk | contribs)‎ (‎Created a new item: #quickstatements) (Tag: Widar [1.4])
(diff | hist) . . Stenbadan (Q21977112)‎; 07:03 . . (-115)‎ . . ‎Kitayama (talk | contribs)‎ (‎Page on [svwiki] deleted: Stenbådan (syd Brändö, Åland)) [rollback]
(diff | hist) . . 99minutos.com (Q33542455)‎; 07:03 . . (-95)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: 99minutos.com) [rollback]
(diff | hist) . . Sughada (Q19899419)‎; 07:03 . . (-88)‎ . . ‎SoWhy (talk | contribs)‎ (‎Page on [enwiki] deleted: Sughada) [rollback]
(diff | hist) . . russians in germany (Q33526978)‎; 07:03 . . (+1)‎ . . ‎ALDO CP (talk | contribs)‎ (‎Page moved from [frwiki:Russe d'Allemagne] to [frwiki:Russes d'Allemagne]) [rollback]
(diff | hist) . . RADIO YI (Q33549953)‎; 07:03 . . (-89)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: RADIO YI) [rollback]
(diff | hist) . . Misiones Nazarenas Internacionales (Q16608007)‎; 07:03 . . (-116)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: Misiones Nazarenas Internacionales) [rollback]


(diff | hist) . . Template:TOC right (Q5626794)‎; 05:48 . . (+1)‎ . . ‎Cycn (talk | contribs)‎ (‎Changed English label: Template:TOC right)
(diff | hist) . . Nathathupatti Gram Panchayat (Q23744541)‎; 05:48 . . (+87)‎ . . ‎Info-farmer (talk | contribs)‎ (‎Updated item: + en.label (Nathathupatti Gram Panchayat) [rollback]
(diff | hist) . . Undersökning af hittills bekanta tantalhaltiga fossiliers sammansåttning (Q33589754)‎; 05:48 . . (+428)‎ . . ‎Chris.urs-o (talk | contribs)‎ (‎Created claim: main subject (P921): mineralogy (Q83353))
(diff | hist) . . Ottawa Porchfest (Q33590321)‎; 05:47 . . (+365)‎ . . ‎Devon Fyson (talk | contribs)‎ (‎Created claim: official website (P856): http://ottawaporchfest.ca)
(diff | hist) . . Jun Maeda (Q1713232)‎; 05:47 . . (+425)‎ . . ‎Innotata (talk | contribs)‎ (‎Created claim: place of death (P20): Manchester (Q18125)) [rollback]

Esc3300 lowered the priority of this task from Unbreak Now! to Needs Triage.Jul 28 2017, 7:12 AM

Esc3300 updated the task description. (Show Details)

Investigation is not over, here is what we have found out for now of the causes:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-only

jcrespo renamed this task from Wikidata database locked to Wikidata and dewiki databases locked.Jul 28 2017, 9:11 AM

Peachey88 added a project: Wikimedia-Incident.Jul 28 2017, 9:14 AM

• Elitre subscribed.Jul 28 2017, 12:29 PM

Bugreporter mentioned this in T171950: Run bot to clean up issues caused by T171928.Jul 28 2017, 12:57 PM

Marostegui subscribed.Jul 31 2017, 11:01 AM

matej_suchanek added a project: User-notice.Jul 31 2017, 2:41 PM

matej_suchanek moved this task from To Triage to In current Tech/News draft on the User-notice board.

I've almost finished the above incident documentation. However, I am unsure about which are the right actionables and their priorities (last section).

let's use this ticket to agree on what would be the best followup, a) making puppet change read-only state of the db server automatically, or b) monitoring read-only status on the master databases, or c) any other option

From my side, I would prefer option "b" (monitoring read-only status on the active masters)
My reasoning for this is:

I wouldn't like puppet to automatically change settings, specially on the masters. And if a master crashes, I want to investigate why it crashed (in case it can repeat the crash as soon as it gets pooled) and make sure it came back in a healthy state before letting it be a writable master again.

I agree; there's a very good reason for setting masters to read-only when something happened, because it needs manual intervention to investigate whether it's safe to go read-write again. Any automation to do that should be REALLY thoroughly thought through, covering all cases, and it seems doubtful we're able to do that now, especially with naive Puppet manifests.

However, on the other end, monitoring and (where appropriate) alerting on masters that are read-only but shouldn't be, OR MediaWiki hitting read-only databases when they shouldn't (aside from lag and other causes) may be feasible.

I have started working on more complete monitoring, useful if we go over the route of human monitoring rather than automation, here is one example:

$ ./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16295390s, read_only: False, s1 lag: 0.00s, 34 client(s), 886.72 QPS, connection latency: 0.075968s, query latency: 0.001053s

$ ./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=1
CRIT: read_only: "False", expected "True"; OK: Version 10.0.28-MariaDB, Uptime 16295413s, s1 lag: 0.00s, 40 client(s), 615.42 QPS, connection latency: 0.069848s, query latency: 0.000987s

It is configurable, so it can be adapted to the best possible monitoring we decide:

./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=0 --check_warn_lag=-1
WARN: s1 lag is 0.31s; OK: Version 10.0.28-MariaDB, Uptime 16295662s, read_only: False, 32 client(s), 761.35 QPS, connection latency: 0.066061s, query latency: 0.001263s

Code coming soon.

Change 369397 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add new python3 script to check the health of a server

https://gerrit.wikimedia.org/r/369397

gerritbot added a project: Patch-For-Review.Aug 1 2017, 3:04 PM

Maybe as an action point for (unlikely) future incidents, when Wikidata goes into read-only the subscriptions mentioned at T171950#3482017 should be queued by the system.

Wikidata goes into read-only the subscriptions mentioned

Yes, definitely some extensions in the past did not behave perfectly and do not respect mediawiki's read-only mode- I do not know what is the state of Wikidata, but for what you say, a ticket should be filed so its state is investigated and handled, to better support read-only mode.

Johan moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Aug 3 2017, 2:45 AM

Change 369397 merged by Jcrespo:
[operations/puppet@production] mariadb: Add new python3 script to check the health of a server

https://gerrit.wikimedia.org/r/369397

$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad
{"datetime": 1501777331.898183, "ssl_expiration": 1619276854.0, "connection": "ok", "connection_latency": 0.07626748085021973, "ssl": true, "total_queries": 15981662418, "heartbeat": {"s1": 0.400536}, "uptime": 16474250, "version": "10.0.28-MariaDB", "query_latency": 0.001131296157836914, "read_only": false, "threads_connected": 49}

So complete coverage is now available for all hosts- including connection checking (with 1 second timeout), ssl (including expiration time), slave status & heartbeat lag, QPS, read only mode, concurrency, uptime/recent restart, version, query latency and connection latency.

$ check_mariadb.py -h labsdb1009 --slave-status --primary-dc=eqiad
{"total_queries": 431519763, "read_only": false, "query_latency": 0.0015988349914550781, "threads_connected": 3, "version": "10.1.25-MariaDB", "ssl": true, "replication": {"s6": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "db1095": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "s7": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "s2": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}}, "datetime": 1501777578.464523, "connection": "ok", "ssl_expiration": 1626257725.0, "connection_latency": 0.05619382858276367, "uptime": 537221, "heartbeat": {"s3": 0.0, "s6": 0.0, "s7": 0.0, "s5": 0.0, "s1": 0.0, "s4": 0.0, "s2": 0.0}}

$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad --icinga --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16474706s, read_only: False, s1 lag: 0.00s, 39 client(s), 481.09 QPS, connection latency: 0.071849s, query latency: 0.001223s

The pending steps is how to use the tools to minimize downtime in the future.

jcrespo added a subtask: T172489: Monitor read_only on all databases, make it page on masters.Aug 4 2017, 8:32 AM

jcrespo added a subtask: T172490: Monitor swap/memory usage on databases.Aug 4 2017, 8:36 AM

I have created all actionables on both the incident documentation ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-only ) and phabricator- consequently, I have closed this ticket and will start to work on the followups to prevent of minimize effects in the future (databases have been ok for over a week).

jcrespo mentioned this in T172489: Monitor read_only on all databases, make it page on masters.Aug 4 2017, 3:23 PM

Liuxinyu970226 unsubscribed.Aug 4 2017, 11:14 PM

Esc3300 mentioned this in T173008: Create a maintenance script for pruning stale entity subscriptions and run periodically.Aug 10 2017, 4:27 PM

jcrespo changed the status of subtask T172489: Monitor read_only on all databases, make it page on masters from Open to Stalled.Sep 19 2018, 1:07 PM

jcrespo changed the status of subtask T172489: Monitor read_only on all databases, make it page on masters from Stalled to Open.May 7 2020, 7:03 AM

jcrespo closed subtask T172489: Monitor read_only on all databases, make it page on masters as Resolved.May 7 2020, 2:04 PM

TerraCodes unsubscribed.May 10 2020, 12:18 AM

jcrespo closed subtask T172490: Monitor swap/memory usage on databases as Resolved.Aug 10 2020, 4:56 PM

Ladsgroup edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:53 PM