Page MenuHomePhabricator

Database downtime around the time period when Parsoid was deployed
Closed, InvalidPublic

Description

Apparently, last deployment of parsoid created a database outage on enwiki API servers:

<logmsgbot> !log arlolra@tin Starting deploy [parsoid/deploy@0df8628]: Updating Parsoid to 6719e240
<stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
<grrrit-wm> (PS2) Giuseppe Lavagetto: api: add TLS termination in eqiad [puppet] - https://gerrit.wikimedia.org/r/327541 
* SPF|Cloud has quit (Quit: Connection closed for inactivity)
* Zppix (uid182351@wikipedia/Zppix) has joined
<grrrit-wm> (CR) Jcrespo: [C: 2] mariadb: Move role::mariadb::client to a separate file [puppet] - https://gerrit.wikimedia.org/r/327556 (https://phabricator.wikimedia.org/T150850) (owner: Jcrespo) 
<icinga-wm> RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
<icinga-wm> PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CRITICAL slave_io_state could not connect
* OuKB (~max@198.73.209.4) has joined
<jynus> that looks bad
<robh> =[
<icinga-wm> RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes
<logmsgbot> !log arlolra@tin Finished deploy [parsoid/deploy@0df8628]: Updating Parsoid to 6719e240 (duration: 07m 56s)
<stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
<arlolra> !log Updated Parsoid to 6719e240 (T96555)
<stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
<stashbot> T96555: Bug tokenizing commented <ref> - https://phabricator.wikimedia.org/T96555

https://logstash.wikimedia.org/goto/c80ee605942700af7442656e9b051296

Most of the api queries at that time seem to be contributions-related:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
406	323	81	33,076	db1065, db1066	wikiuser	enwiki
SELECT /* ApiQueryContributors::execute */ rev_page AS `page`, rev_user AS `user`, MAX(rev_user_text) AS `username` FROM `revision` WHERE rev_page = '9228' AND (rev_user != 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page, rev_user ORDER BY rev_user LIMIT 501 /* 844ceca7dab690eb120535782678485f db1066 enwiki 23s */

The database API at the time seem to be stuck on executing >5000 queries at the same time:
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1065&from=1481825322195&to=1481828922196
When MySQL reaches max-connections, new connections are rejected.

Bringing down a server is something that should not be taken lightly, and a worse outage could had happen if the 2 servers would have been saturated at the same time rather than only 1.

Event Timeline

jcrespo renamed this task from Parsoid deployment create database downtime to Parsoid deployment creates database downtime.Dec 15 2016, 7:10 PM
jcrespo updated the task description. (Show Details)

OCG fetches contributors, but if it were OCG, then (1) why would it be correlated with Parsoid restart, and (2) those requests should have been coming directly from OCG, not from parsoid. OCG code is here:
https://github.com/wikimedia/mediawiki-extensions-Collection-OfflineContentGenerator-bundler/blob/master/lib/authors.js
https://github.com/wikimedia/mediawiki-extensions-Collection-OfflineContentGenerator-bundler/blob/master/lib/attribution.js

Could this be instead related to @Legoktm's recent BadImages patches? Although the above query seems to be generating an alphabetic list of all contributors to a certain page, and I don't think that's required to evaluate BadImage status....

PageID 9228 is https://en.wikipedia.org/w/index.php?curid=9228 (Earth) fwiw. Were image pages affected, too? If it was OCG, I'd expect to see requests for both page contributors and image contributors.

@Arlolra Not sure it's actually related. T75412: OCG Attribution request times out reguarly causes the Parser/Parsoid to work hard expanding a large template, but (1) it doesn't involve any database queries, and (2) it could probably be solved by breaking up the template instantiation into parts and sticking them back together; it's the single request size that's a problem (causing a timeout), not the amount of work.

OTOH, this bug seems like it's from something looking at all users involved in a specific page, whether that's for OCG's attribution page or to determine if there were bad users involved or something else. In OCG, that happens *before* the T75412 template expansion would be requested -- all the users are sent to OCG, which then creates an appropriate template invocation and sends that to the parser.

If OCG were implicated, the most likely culprit would actually seem to be retries somehow. OCG doesn't talk to Parsoid directly, it talks to RESTBase and the MediaWiki API, although the T75412: OCG Attribution request times out reguarly template expansion gets redirected to Parsoid via RESTBase. When Parsoid was restarted (or temporarily unavailable) RESTBase would have to return some error code to OCG which would cause OCG to repeatedly retry not just the template expansion but the entire attribution query, causing the storm of attribution queries described in this task description. I don't see how that would happen... although bugs by definition are surprising.

At first glance at the data in logstash, I think the ApiQueryContributors hits are background noise here. The peak at 18:19:00 reports 132 hits in the 30-second bucket, and 91 when ApiQueryContributors is filtered out.

So far, it seems like this is coincidence and unrelated to the Parsoid deploy, since Parsoid doesn't trigger that contributors query.

Not sure it's actually related.

Yeah, doesn't seem to be, sorry.

From OCG's contributorsQuery,

	var request = {
		action: 'query',
		prop: 'contributors',
		continue: '',
		pclimit: 500,
	};

which corresponds with the url from the req /w/api.php?action=query&prop=contributors&continue=&pclimit=500&revids=754871499&format=json

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2016.12.15/mediawiki?id=AVkDvOas3-sTQj-LLt2k&_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2016-12-15T18:13:01.087Z',mode:absolute,to:'2016-12-15T18:27:10.008Z'))

The SQL query seems to match https://github.com/wikimedia/mediawiki/blob/master/includes/api/ApiQueryContributors.php#L108 . OCG uses this at https://github.com/wikimedia/mediawiki-extensions-Collection-OfflineContentGenerator-bundler/blob/master/lib/authors.js#L57 but presumably there are other users of query=contributors in the API as well. Can we correlate the database requests with API requests, and maybe get a source IP and/or User-Agent? That would clear/implicate OCG.

It looks like @Anomie is saying that the contributors requests weren't actually anomalous, and we should be looking at the other 91 requests in that 30-second bucket at 18:19:00?

While there was an uptick in queries to prop=contributors at 18:19 (363/minute) and 18:20 (322/minute) compared to the minutes just before and after (each around 150/minute), that's less than the rates at other times (e.g. 535 at 18:05, 547 at 18:24).

I looked at the API logs a few different ways and nothing really jumped out at me besides that the number of queries (of all types, not just prop=contributors) taking longer than 1 second started increasing at 18:15, hit a peak at 18:19, and was back to around normal by 18:24.

Arlolra changed the task status from Open to Stalled.Mar 28 2017, 6:27 PM
Arlolra triaged this task as High priority.
ssastry renamed this task from Parsoid deployment creates database downtime to Database downtime around the time period when Parsoid was deployed.Apr 27 2017, 2:57 PM
ssastry lowered the priority of this task from High to Medium.May 19 2017, 9:20 PM

I am inclined to close this as declined. There is no additional information that indicates that this is / was Parsoid related.

I am ok with that, it hasn't happened after that, it could be a very rare coincidence.