Page MenuHomePhabricator

503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Hi, today i tryed loading https://meta.wikimedia.org/wiki/Privacy_policy which was stuck in loading state.

Wikipedia en was loading very slowly too.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Same in Russian Wikipedia (from Russia):

Request from […] via cp1054 cp1054, Varnish XID 382637319
Error: 503, Backend fetch failed at Tue, 22 May 2018 13:43:18 GMT

I guess this affects all Wikimedia projects right now.

@Xaosflux be aware that editing the comment doesn't erase the IP info. It's still visible in "View edit history", on the comment's contextual menu.

Paladox lowered the priority of this task from Unbreak Now! to High.May 22 2018, 1:59 PM

Things have recovered - the DBs errors, so far, look like a consequence and not the cause.
We are investigating

jcrespo subscribed.

The issue seems to point to

metawiki_overload.png (893×1 px, 147 KB)

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
33533	15	5	173,069	db1079, db1086, db1094	wikiuser	metawiki

That is the equivalent to 10-100 database hours of work in 5 minutes.

The query is very slow, normally from memory, it takes 0.5 seconds, but that is quite bad, and I can see it running with no index and scanning 200K rows.

EXPLAIN SELECT /* MessageCollection::loadReviewInfo <IP> */ page_namespace, page_title, trr_user FROM `page` JOIN `translate_reviews` ON ((page_id=trr_page) AND (page_latest=trr_revision)) WHERE (page_namespace = '1198' AND page_title IN ('Privacy_policy/Page_display_title/en', 'Privacy_policy/275/en', 'Privacy_policy/244/en', 'Privacy_policy/245/en', 'Privacy_policy/22/en', 'Privacy_policy/23/en', 'Privacy_policy/24/en', 'Privacy_policy/25/en', 'Privacy_policy/26/en', 'Privacy_policy/246/en', 'Privacy_policy/247/en', 'Privacy_policy/243/en', 'Privacy_policy/242/en', 'Privacy_policy/87/en', 'Privacy_policy/29/en', 'Privacy_policy/186/en', 'Privacy_policy/187/en', 'Privacy_policy/188/en', 'Privacy_policy/189/en', 'Privacy_policy/190/en', 'Privacy_policy/30/en', 'Privacy_policy/31/en', 'Privacy_policy/32/en', 'Privacy_policy/33/en', 'Privacy_policy/34/en', 'Privacy_policy/35/en', 'Privacy_policy/248/en', 'Privacy_policy/36/en', 'Privacy_policy/58/en', 'Privacy_policy/59/en', 'Privacy_policy/60/en', 'Privacy_policy/61/en', 'Privacy_policy/62/en', 'Privacy_policy/63/en', 'Privacy_policy/64/en', 'Privacy_policy/65/en', 'Privacy_policy/66/en', 'Privacy_policy/185/en', 'Privacy_policy/89/en', 'Privacy_policy/90/en', 'Privacy_policy/91/en', 'Privacy_policy/191/en', 'Privacy_policy/93/en', 'Privacy_policy/94/en', 'Privacy_policy/239/en', 'Privacy_policy/249/en', 'Privacy_policy/95/en', 'Privacy_policy/193/en', 'Privacy_policy/194/en', 'Privacy_policy/195/en', 'Privacy_policy/96/en', 'Privacy_policy/97/en', 'Privacy_policy/98/en', 'Privacy_policy/250/en', 'Privacy_policy/99/en', 'Privacy_policy/100/en', 'Privacy_policy/251/en', 'Privacy_policy/125/en', 'Privacy_policy/126/en', 'Privacy_policy/210/en', 'Privacy_policy/127/en', 'Privacy_policy/211/en', 'Privacy_policy/128/en', 'Privacy_policy/129/en', 'Privacy_policy/213/en', 'Privacy_policy/130/en', 'Privacy_policy/131/en', 'Privacy_policy/132/en', 'Privacy_policy/101/en', 'Privacy_policy/197/en', 'Privacy_policy/200/en', 'Privacy_policy/252/en', 'Privacy_policy/102/en', 'Privacy_policy/199/en', 'Privacy_policy/103/en', 'Privacy_policy/104/en', 'Privacy_policy/105/en', 'Privacy_policy/201/en', 'Privacy_policy/106/en', 'Privacy_policy/202/en', 'Privacy_policy/107/en', 'Privacy_policy/108/en', 'Privacy_policy/109/en', 'Privacy_policy/110/en', 'Privacy_policy/111/en', 'Privacy_policy/112/en', 'Privacy_policy/113/en', 'Privacy_policy/114/en', 'Privacy_policy/115/en', 'Privacy_policy/116/en', 'Privacy_policy/117/en', 'Privacy_policy/240/en', 'Privacy_policy/203/en', 'Privacy_policy/118/en', 'Privacy_policy/119/en', 'Privacy_policy/120/en', 'Privacy_policy/253/en', 'Privacy_policy/254/en', 'Privacy_policy/255/en', 'Privacy_policy/256/en', 'Privacy_policy/257/en', 'Privacy_policy/258/en', 'Privacy_policy/259/en', 'Privacy_policy/260/en', 'Privacy_policy/261/en', 'Privacy_policy/262/en', 'Privacy_policy/123/en', 'Privacy_policy/263/en', 'Privacy_policy/264/en', 'Privacy_policy/265/en', 'Privacy_policy/266/en', 'Privacy_policy/267/en', 'Privacy_policy/268/en', 'Privacy_policy/269/en', 'Privacy_policy/270/en', 'Privacy_policy/271/en', 'Privacy_policy/272/en', 'Privacy_policy/133/en', 'Privacy_policy/134/en', 'Privacy_policy/135/en', 'Privacy_policy/216/en', 'Privacy_policy/136/en', 'Privacy_policy/217/en', 'Privacy_policy/137/en', 'Privacy_policy/218/en', 'Privacy_policy/138/en', 'Privacy_policy/139/en', 'Privacy_policy/140/en', 'Privacy_policy/219/en', 'Privacy_policy/141/en', 'Privacy_policy/220/en', 'Privacy_policy/142/en', 'Privacy_policy/221/en', 'Privacy_policy/143/en', 'Privacy_policy/236/en', 'Privacy_policy/144/en', 'Privacy_policy/145/en', 'Privacy_policy/146/en', 'Privacy_policy/222/en', 'Privacy_policy/147/en', 'Privacy_policy/223/en', 'Privacy_policy/148/en', 'Privacy_policy/273/en', 'Privacy_policy/178/en', 'Privacy_policy/149/en', 'Privacy_policy/224/en', 'Privacy_policy/150/en', 'Privacy_policy/151/en', 'Privacy_policy/152/en', 'Privacy_policy/227/en', 'Privacy_policy/153/en', 'Privacy_policy/228/en', 'Privacy_policy/154/en', 'Privacy_policy/155/en', 'Privacy_policy/156/en', 'Privacy_policy/157/en', 'Privacy_policy/237/en', 'Privacy_policy/158/en', 'Privacy_policy/274/en', 'Privacy_policy/229/en', 'Privacy_policy/159/en', 'Privacy_policy/230/en', 'Privacy_policy/160/en', 'Privacy_policy/161/en', 'Privacy_policy/162/en', 'Privacy_policy/163/en', 'Privacy_policy/166/en', 'Privacy_policy/164/en', 'Privacy_policy/165/en', 'Privacy_policy/231/en', 'Privacy_policy/180/en', 'Privacy_policy/232/en', 'Privacy_policy/233/en', 'Privacy_policy/181/en', 'Privacy_policy/182/en', 'Privacy_policy/183/en', 'Privacy_policy/234/en', 'Privacy_policy/167/en', 'Privacy_policy/235/en', 'Privacy_policy/168/en', 'Privacy_policy/169/en', 'Privacy_policy/170/en', 'Privacy_policy/171/en', 'Privacy_policy/172/en', 'Privacy_policy/173/en', 'Privacy_policy/174/en', 'Privacy_policy/175/en', 'Privacy_policy/176/en', 'Privacy_policy/177/en') )\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: translate_reviews
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 12
          ref: NULL
         rows: 194094
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY,name_title
          key: PRIMARY
      key_len: 4
          ref: metawiki.translate_reviews.trr_page
         rows: 1
        Extra: Using where
2 rows in set (0.03 sec)

Finally leading to: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1526994927108&to=1526998716796

Many users in Arabic Wikipedia suffer from this problem also!

With the comment above mentioning metawiki and Privacy_policy I can't help but speculate that the edit at 13:28 or translation log entries at 13:26 and 13:32 might have had something to do with it...

https://meta.wikimedia.org/w/index.php?title=Privacy_policy&diff=18066512&oldid=18063747&diffmode=source

13:32, 22 May 2018 Kaganer (talk | contribs) changed the state of Russian translations of Privacy policy from In progress to Needs updating
13:28, 22 May 2018‎ Kaganer (talk | contribs)‎ m . . (52,758 bytes) (+3)‎ . . (thank) - https://meta.wikimedia.org/w/index.php?title=Privacy_policy&diff=18066512&oldid=18063747
13:26, 22 May 2018 FuzzyBot (talk | contribs) changed the state of Russian translations of Privacy policy from Needs updating to In progress

Nikerabbit subscribed.

Also:
https://meta.wikimedia.org/w/index.php?title=Special:Log&page=Privacy+policy&uselang=en

13:29, 22 May 2018 Kaganer (talk | contribs) marked Privacy policy for translation

It is not clear to me what could have caused such a high number of simultaneous queries for MessageCollection::loadReviewInfo, but it is easy to understand that if they start piling up it goes very fast downhill from there.

TranslationsUpdateJob does:

		MessageGroupStats::clearGroup( $id );
		MessageGroupStats::forGroup( $id );

I am not 100% sure how transactions work in the job queue, but if these two are not executed inside the same transaction, there could be a brief moment where stats are missing and many simultaneous page views to Privacy_policy could trigger a bunch of MessageCollection::loadReviewInfo before any one of them completes and the result is cached for subsequent edits.

Other possibility is that someone would have been spamming Special:Translate that does not use a cache, or someone would have been explicitly purging the translation stats cache for the page (possibly many times).

Looking at logstash, I see these queries from a handful of IPs and some without IP/name (JobQueue I assume). It doesn't have stacktraces or request URLs in those slowquery log entries, but the DBPerformance has and it shows a lot of direct page views or renderings of the page through API. Querying for a particular ip I see A LOT of varnishslowlog entries... it seems possible that some kind of bot (or even multiple of them) have been hitting that page hard.

Anyway, Jaime has correctly identified that the query is not using a index. translate_reviews has an index on (trr_user, trr_page, trr_revision) but that particular query is only specifying trr_page and trr_revision. Doing a grep inside Translate, I only found one other place where this table is accessed: ApiTranslationReview, which does an insert with all values provided. This leads me to think that we should swap the order of columns in the unique index to (trr_page, trr_revision, trr_user) to make it usable for the query in MessageCollection.

@jcrespo @Marostegui Do you agree with this conclusion?

Also, does anyone think that fixing the index is not sufficient, and that MessageGroupStats should use, for example, PoolCounter to limit the number of concurrent queries?

It appears to me in the desktop view put not in mobile version although in mobile version i just can explore pages but can't edit!

@1339861mzb Are you sure this is related? Please read https://www.mediawiki.org/wiki/How_to_report_a_bug and fill a new task. This seems to be unrelated.

@jcrespo @Marostegui Do you agree with this conclusion?

Any feedback, please?

I am ok with any index change you want, just create the ticket. Independently of that, you should make sure to avoid any cache stampede issues, maybe poolcounter could help with that?

Vvjjkkii renamed this task from 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) to iicaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
1339861mzb renamed this task from iicaaaaaaa to 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) .Jul 1 2018, 6:31 AM
1339861mzb updated the task description. (Show Details)

So full outage but no actionable yet?

So full outage but no actionable yet?

You responded shortly before I left for vacation, and it seems people have been waiting for me to come back to fix this. I will start working on this by adding the missing index and exploring the other measures discussed.

Hey, you had an excuse, the rest of the people didn't :-D :-P

Change 450032 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Optimize index on translate_reviews

https://gerrit.wikimedia.org/r/450032

While testing this index patch, I noticed I had locally created a custom index for translatewiki.net which I had forgotten. *scolds the past himself*

Change 450032 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Optimize index on translate_reviews

https://gerrit.wikimedia.org/r/450032

Nikerabbit changed the task status from Open to Stalled.Aug 30 2018, 9:51 AM

I consider that sufficient work has been done to close this task. I propose the follow-up work to avoid stampedes to be done in a separate task (T134252: Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear seems suitable) and to be scheduled independently.

Nikerabbit changed the task status from Stalled to Open.Sep 19 2018, 8:46 AM

to be scheduled independently.

This is cool to me, thank you for your work!

Pginer-WMF claimed this task.
Pginer-WMF subscribed.

Based on the above comments, it seems this can be closed

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM