Page MenuHomePhabricator

Stalls on db1075 (s3) replica db
Closed, ResolvedPublicSecurity

Description

There has been at least 2 stalls recently on db1075. The first happened just after datacenter switchover codfw->eqiad, so we didn't give it too much importance, other than possibly caused by cold caches.

A new one happened around 2020-10-29 11:01, which also happened to coincide with a connectivity issue. However, while those 2 events (sudden increase in traffic) could be a trigger, the issue may be not directly related to those.

Checking tendril, I see a heavy query hitting at the time of the last event:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
2425	21	8	21,210	db1075	wikiuser	ruwikinews
SELECT /* DynamicPageListHooks::renderDynamicPageList */ page_namespace, page_title, c1.cl_timestamp FROM `page` LEFT JOIN `flaggedpages` ON ((page_id = fp_page_id)) INNER JOIN `categorylinks` `c1` ON ((page_id = c1.cl_from) AND (c1.cl_to='????????????')) INNER JOIN `categorylinks` `c2` ON ((page_id = c2.cl_from) AND (c2.cl_to='?????????')) LEFT OUTER JOIN `categorylinks` `c3` ON ((page_id = c3.cl_from) AND (c3.cl_to='??_???????????')) LEFT OUTER JOIN `categorylinks` `c4` ON ((page_id = c4.cl_from) AND (c4.cl_to='?????????_???????_??_?????')) LEFT OUTER JOIN `categorylinks` `c5` ON ((page_id = c5.cl_from) AND (c5.cl_to='????????_???????')) LEFT OUTER JOIN `categorylinks` `c6` ON ((page_id = c6.cl_from) AND (c6.cl_to='???????????_???????')) WHERE page_namespace = 0 AND (fp_stable IS NOT NULL) AND page_is_redirect = 0 AND c3.cl_to IS NULL AND c4.cl_to IS NULL AND c5.cl_to IS NULL AND c6.cl_to IS NULL ORDER BY page_id DESC LIMIT 18 /* a66f1dbef76187d62eb21f46256602f1 db1075 ruwikinews 2s */

Screenshot_20201029_123535.png (1×2 px, 238 KB)

Only ruwikinews activity was detected as slow at the time of the last incident, which could be related to an inneficient code path.

Looking at db1075, high innodb activity was seen at the time of the issue:

Screenshot_20201029_124212.png (970×2 px, 98 KB)

with spikes of rows read, with no large change in user requests.

My working theory is an infrequent but badly optimized query overloading the server. Making it private as it currently contains security-sensitive information, as well as an input path to DOS the databases.

Details

Author Affiliation
WMF Technology Dept

Event Timeline

Adding MediaWiki-General because we need to identify which team may know about DynamicPageListHooks::renderDynamicPageList

These are DPL queries again hitting ruwikinews - possibly related to T262240 and T262391 ?

Marostegui moved this task from Triage to In progress on the DBA board.

These are DPL queries again hitting ruwikinews - possibly related to T262240 and T262391 ?

Sounds so... We can disable it (again), but that would probably make ruwikinews community (again) very mad.

Note: I've backported https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626268 a while ago. That created a short-lived cache (2 minutes) for the expensive DPL queries, so MediaWiki won't recalculate the wheel again and again. Hopefully it will help.

Looking great so far-

Large amount of rows read flattened:

Screenshot_20201029_163713.png (1×2 px, 224 KB)

Slow queries on db1075 reduced to 0:

Screenshot_20201029_163754.png (1×2 px, 239 KB)

I am mostly certain that this was the issue causing db1075 stalls, as processlist has decreased a lot (a slow query can be millions of times more impactful than a regular query).

Screenshot_20201029_164432.png (1×2 px, 161 KB)

From a certain point of view, @Urbanecm you made db1075 10x faster, thank you!

CDanis changed the visibility from "Custom Policy" to "Public (No Login Required)".Oct 29 2020, 3:45 PM
CDanis changed the edit policy from "Custom Policy" to "All Users".
CDanis removed a project: Security.
jcrespo claimed this task.

I am going to consider this resolved, unless we were completely wrong and this wasn't the cause of the database stalls/connectivity issues.

Not related, but first incident was spotted during datacenter-switchover.

I am mostly certain that this was the issue causing db1075 stalls, as processlist has decreased a lot (a slow query can be millions of times more impactful than a regular query).

Screenshot_20201029_164432.png (1×2 px, 161 KB)

From a certain point of view, @Urbanecm you made db1075 10x faster, thank you!

Out of curiosity, did this have an effect on db performance relative to normal times when ruwikinews isn't overloading things?

db performance

It did, everytime a database goes over 64 simultaneous connections, the queries are queued up. While this prevent impacting a lot the query running latency, it impacts the overal response latency (due to longer wait times before execution).

Sadly, while we have the metrics, we don't have historicals for the metrics to provide you an exact answer on impact, in particular also defining what is "normal" or what is "latency". The 10x was one of the potential metrics- how much metrics scraping time took during bad times and currently, but it is not a full picture of the impact, that is why I stressed on "from a certain point of view":

https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=40&orgId=1&var-server=db1075&var-port=9104&from=1603970444083&to=1603986844156