|Resolved||PRODUCTION ERROR||Pginer-WMF||T195293 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly)|
|Resolved||Marostegui||T201011 Apply schema change to translate_reviews in WMF|
- Mentioned In
- T134252: Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear
T203925: Save times for changes to translation variable text in centralnotice paralysingly slow
T195523: failure of open wikipedia multiple times
T193562: yiwikisource logo needs updating
T194389: Enable $wgUseRCPatrol on azwiki
- Mentioned Here
- T134252: Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear
T201011: Apply schema change to translate_reviews in WMF
T195520: Multiple projects reporting Cannot access the database: No working replica DB server
Same in Russian Wikipedia (from Russia):
Request from […] via cp1054 cp1054, Varnish XID 382637319 Error: 503, Backend fetch failed at Tue, 22 May 2018 13:43:18 GMT
I guess this affects all Wikimedia projects right now.
The issue seems to point to
Hits Tmax Tavg Tsum Hosts Users Schemas 33533 15 5 173,069 db1079, db1086, db1094 wikiuser metawiki
That is the equivalent to 10-100 database hours of work in 5 minutes.
The query is very slow, normally from memory, it takes 0.5 seconds, but that is quite bad, and I can see it running with no index and scanning 200K rows.
EXPLAIN SELECT /* MessageCollection::loadReviewInfo <IP> */ page_namespace, page_title, trr_user FROM `page` JOIN `translate_reviews` ON ((page_id=trr_page) AND (page_latest=trr_revision)) WHERE (page_namespace = '1198' AND page_title IN ('Privacy_policy/Page_display_title/en', 'Privacy_policy/275/en', 'Privacy_policy/244/en', 'Privacy_policy/245/en', 'Privacy_policy/22/en', 'Privacy_policy/23/en', 'Privacy_policy/24/en', 'Privacy_policy/25/en', 'Privacy_policy/26/en', 'Privacy_policy/246/en', 'Privacy_policy/247/en', 'Privacy_policy/243/en', 'Privacy_policy/242/en', 'Privacy_policy/87/en', 'Privacy_policy/29/en', 'Privacy_policy/186/en', 'Privacy_policy/187/en', 'Privacy_policy/188/en', 'Privacy_policy/189/en', 'Privacy_policy/190/en', 'Privacy_policy/30/en', 'Privacy_policy/31/en', 'Privacy_policy/32/en', 'Privacy_policy/33/en', 'Privacy_policy/34/en', 'Privacy_policy/35/en', 'Privacy_policy/248/en', 'Privacy_policy/36/en', 'Privacy_policy/58/en', 'Privacy_policy/59/en', 'Privacy_policy/60/en', 'Privacy_policy/61/en', 'Privacy_policy/62/en', 'Privacy_policy/63/en', 'Privacy_policy/64/en', 'Privacy_policy/65/en', 'Privacy_policy/66/en', 'Privacy_policy/185/en', 'Privacy_policy/89/en', 'Privacy_policy/90/en', 'Privacy_policy/91/en', 'Privacy_policy/191/en', 'Privacy_policy/93/en', 'Privacy_policy/94/en', 'Privacy_policy/239/en', 'Privacy_policy/249/en', 'Privacy_policy/95/en', 'Privacy_policy/193/en', 'Privacy_policy/194/en', 'Privacy_policy/195/en', 'Privacy_policy/96/en', 'Privacy_policy/97/en', 'Privacy_policy/98/en', 'Privacy_policy/250/en', 'Privacy_policy/99/en', 'Privacy_policy/100/en', 'Privacy_policy/251/en', 'Privacy_policy/125/en', 'Privacy_policy/126/en', 'Privacy_policy/210/en', 'Privacy_policy/127/en', 'Privacy_policy/211/en', 'Privacy_policy/128/en', 'Privacy_policy/129/en', 'Privacy_policy/213/en', 'Privacy_policy/130/en', 'Privacy_policy/131/en', 'Privacy_policy/132/en', 'Privacy_policy/101/en', 'Privacy_policy/197/en', 'Privacy_policy/200/en', 'Privacy_policy/252/en', 'Privacy_policy/102/en', 'Privacy_policy/199/en', 'Privacy_policy/103/en', 'Privacy_policy/104/en', 'Privacy_policy/105/en', 'Privacy_policy/201/en', 'Privacy_policy/106/en', 'Privacy_policy/202/en', 'Privacy_policy/107/en', 'Privacy_policy/108/en', 'Privacy_policy/109/en', 'Privacy_policy/110/en', 'Privacy_policy/111/en', 'Privacy_policy/112/en', 'Privacy_policy/113/en', 'Privacy_policy/114/en', 'Privacy_policy/115/en', 'Privacy_policy/116/en', 'Privacy_policy/117/en', 'Privacy_policy/240/en', 'Privacy_policy/203/en', 'Privacy_policy/118/en', 'Privacy_policy/119/en', 'Privacy_policy/120/en', 'Privacy_policy/253/en', 'Privacy_policy/254/en', 'Privacy_policy/255/en', 'Privacy_policy/256/en', 'Privacy_policy/257/en', 'Privacy_policy/258/en', 'Privacy_policy/259/en', 'Privacy_policy/260/en', 'Privacy_policy/261/en', 'Privacy_policy/262/en', 'Privacy_policy/123/en', 'Privacy_policy/263/en', 'Privacy_policy/264/en', 'Privacy_policy/265/en', 'Privacy_policy/266/en', 'Privacy_policy/267/en', 'Privacy_policy/268/en', 'Privacy_policy/269/en', 'Privacy_policy/270/en', 'Privacy_policy/271/en', 'Privacy_policy/272/en', 'Privacy_policy/133/en', 'Privacy_policy/134/en', 'Privacy_policy/135/en', 'Privacy_policy/216/en', 'Privacy_policy/136/en', 'Privacy_policy/217/en', 'Privacy_policy/137/en', 'Privacy_policy/218/en', 'Privacy_policy/138/en', 'Privacy_policy/139/en', 'Privacy_policy/140/en', 'Privacy_policy/219/en', 'Privacy_policy/141/en', 'Privacy_policy/220/en', 'Privacy_policy/142/en', 'Privacy_policy/221/en', 'Privacy_policy/143/en', 'Privacy_policy/236/en', 'Privacy_policy/144/en', 'Privacy_policy/145/en', 'Privacy_policy/146/en', 'Privacy_policy/222/en', 'Privacy_policy/147/en', 'Privacy_policy/223/en', 'Privacy_policy/148/en', 'Privacy_policy/273/en', 'Privacy_policy/178/en', 'Privacy_policy/149/en', 'Privacy_policy/224/en', 'Privacy_policy/150/en', 'Privacy_policy/151/en', 'Privacy_policy/152/en', 'Privacy_policy/227/en', 'Privacy_policy/153/en', 'Privacy_policy/228/en', 'Privacy_policy/154/en', 'Privacy_policy/155/en', 'Privacy_policy/156/en', 'Privacy_policy/157/en', 'Privacy_policy/237/en', 'Privacy_policy/158/en', 'Privacy_policy/274/en', 'Privacy_policy/229/en', 'Privacy_policy/159/en', 'Privacy_policy/230/en', 'Privacy_policy/160/en', 'Privacy_policy/161/en', 'Privacy_policy/162/en', 'Privacy_policy/163/en', 'Privacy_policy/166/en', 'Privacy_policy/164/en', 'Privacy_policy/165/en', 'Privacy_policy/231/en', 'Privacy_policy/180/en', 'Privacy_policy/232/en', 'Privacy_policy/233/en', 'Privacy_policy/181/en', 'Privacy_policy/182/en', 'Privacy_policy/183/en', 'Privacy_policy/234/en', 'Privacy_policy/167/en', 'Privacy_policy/235/en', 'Privacy_policy/168/en', 'Privacy_policy/169/en', 'Privacy_policy/170/en', 'Privacy_policy/171/en', 'Privacy_policy/172/en', 'Privacy_policy/173/en', 'Privacy_policy/174/en', 'Privacy_policy/175/en', 'Privacy_policy/176/en', 'Privacy_policy/177/en') )\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: translate_reviews type: index possible_keys: NULL key: PRIMARY key_len: 12 ref: NULL rows: 194094 Extra: Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: page type: eq_ref possible_keys: PRIMARY,name_title key: PRIMARY key_len: 4 ref: metawiki.translate_reviews.trr_page rows: 1 Extra: Using where 2 rows in set (0.03 sec)
With the comment above mentioning metawiki and Privacy_policy I can't help but speculate that the edit at 13:28 or translation log entries at 13:26 and 13:32 might have had something to do with it...
13:28, 22 May 2018 Kaganer (talk | contribs) m . . (52,758 bytes) (+3) . . (thank) - https://meta.wikimedia.org/w/index.php?title=Privacy_policy&diff=18066512&oldid=18063747
It is not clear to me what could have caused such a high number of simultaneous queries for MessageCollection::loadReviewInfo, but it is easy to understand that if they start piling up it goes very fast downhill from there.
MessageGroupStats::clearGroup( $id ); MessageGroupStats::forGroup( $id );
I am not 100% sure how transactions work in the job queue, but if these two are not executed inside the same transaction, there could be a brief moment where stats are missing and many simultaneous page views to Privacy_policy could trigger a bunch of MessageCollection::loadReviewInfo before any one of them completes and the result is cached for subsequent edits.
Other possibility is that someone would have been spamming Special:Translate that does not use a cache, or someone would have been explicitly purging the translation stats cache for the page (possibly many times).
Looking at logstash, I see these queries from a handful of IPs and some without IP/name (JobQueue I assume). It doesn't have stacktraces or request URLs in those slowquery log entries, but the DBPerformance has and it shows a lot of direct page views or renderings of the page through API. Querying for a particular ip I see A LOT of varnishslowlog entries... it seems possible that some kind of bot (or even multiple of them) have been hitting that page hard.
Anyway, Jaime has correctly identified that the query is not using a index. translate_reviews has an index on (trr_user, trr_page, trr_revision) but that particular query is only specifying trr_page and trr_revision. Doing a grep inside Translate, I only found one other place where this table is accessed: ApiTranslationReview, which does an insert with all values provided. This leads me to think that we should swap the order of columns in the unique index to (trr_page, trr_revision, trr_user) to make it usable for the query in MessageCollection.
You responded shortly before I left for vacation, and it seems people have been waiting for me to come back to fix this. I will start working on this by adding the missing index and exploring the other measures discussed.
I consider that sufficient work has been done to close this task. I propose the follow-up work to avoid stampedes to be done in a separate task (T134252: Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear seems suitable) and to be scheduled independently.