Page MenuHomePhabricator

Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage
Closed, ResolvedPublicBUG REPORT

Description

I am not sure if this relates to T243548 or T242501 but yesterday and today I keep running into time-out errors or pages failing to load on all Wikimedia websites. It's an intermittent issue but over half of the time the pages don't load.

Event Timeline

JEumerus changed the subtype of this task from "Task" to "Bug Report".
JEumerus added a project: Performance Issue.

Since about two hours I also have timeouts on API and webpage. (Location Berlin)
The Grafana boards also show low edit rates and API requests and HTTP Errors.

T242501 is most probably unrelated.

This was a major outage affecting most of Europe (e. g. problems in esams / Amsterdam datacenter). Now it seems to be finally over.

I am from Czech republic and I was unable to open any wikimedia page (wikipedia, phabricator etc.) on desktop for more than an hour (17:30 - 18:30 UTC), but the outage itself most probably lasted much longer (according to reports on downdetector.com website).

Probably worth a Wikimedia-Incident label. Not UBN!, because the site is back now.

If you are still currently experiencing connectivity issues, please let us know, and when you do, please also report:

  • The IP address to which you resolve en.wikipedia.org
  • The output of a TCP traceroute to port 443 of en.wikipedia.org

On Linux computers you should be able to do traceroute --tcp --port 443 en.wikipedia.org or mtr --report-wide --aslookup --tcp --port 443 en.wikipedia.org

On Windows computers you can open a command prompt and do tracert en.wikipedia.org

Thanks!

This comment was removed by Piastu.

Łódź, Poland
Some "504 Gateway Time-out" for a while (on pl.wikipedia.org)
After that tracert looks like this:

traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets
 1  192.168.1.100 (192.168.1.100)  1.641 ms  6.580 ms  6.580 ms
 2  * * *
 3  * * *
 4  * * *
 5  * 89.228.6.6 (89.228.6.6)  15.771 ms *
 6  * 89.228.6.1 (89.228.6.1)  12.570 ms *
 7  * * *
 8  * * *
 9  * * *
10  162.158.101.22 (162.158.101.22)  16.883 ms  16.849 ms 162.158.101.80 (162.158.101.80)  17.655 ms
11  162.158.101.22 (162.158.101.22)  17.292 ms  17.005 ms 162.158.101.80 (162.158.101.80)  17.090 ms
12  * * *
13  text-lb.esams.wikimedia.org (91.198.174.192)  40.610 ms  40.185 ms  45.367 ms

and one more:

traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets
 1  192.168.1.100 (192.168.1.100)  1.407 ms  7.217 ms  7.213 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  89.228.6.1 (89.228.6.1)  19.220 ms * *
 7  * * *
 8  * * et-0-0-23.bar4.Warsaw1.Level3.net (213.242.118.177)  12.536 ms
 9  * * *
10  162.158.101.22 (162.158.101.22)  15.863 ms  17.780 ms 162.158.101.80 (162.158.101.80)  17.682 ms
11  162.158.101.80 (162.158.101.80)  17.832 ms  17.923 ms  18.011 ms
12  * * *
13  text-lb.esams.wikimedia.org (91.198.174.192)  34.336 ms  55.508 ms  51.682 ms

and the last one:

traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets
 1  192.168.1.100 (192.168.1.100)  1.629 ms  5.759 ms  5.765 ms
 2  * * *
 3  * * 10.0.2.5 (10.0.2.5)  15.424 ms
 4  * * *
 5  * * *
 6  89.228.6.1 (89.228.6.1)  26.905 ms *  121.946 ms
 7  host-89-228-4-5.dynamic.mm.pl (89.228.4.5)  122.545 ms * *
 8  et-0-0-23.bar4.Warsaw1.Level3.net (213.242.118.177)  10.686 ms * *
 9  dialup-212.162.18.186.frankfurt1.eu.level3.net (213.242.117.186)  43.082 ms * *
10  162.158.101.80 (162.158.101.80)  19.219 ms  19.046 ms  19.002 ms
11  162.158.101.22 (162.158.101.22)  19.090 ms  19.013 ms 162.158.101.80 (162.158.101.80)  19.310 ms
12  * * *
13  text-lb.esams.wikimedia.org (91.198.174.192)  43.560 ms  44.304 ms  44.279 ms

Seems to work, but not very stable.
enwiki resolves to 91.198.174.192

Im experiancing significant slowness (but it eventually works) connecting to 198.35.26.96 (dyna.wikimedia.org) [on a cellphone, cant easily get a traceroute]

QEDK triaged this task as Unbreak Now! priority.EditedJan 26 2020, 8:06 PM
QEDK added a subscriber: QEDK.

Got an error page (Error 504) once and now site does not load at all.
My traceroute is completely empty:

traceroute to dyna.wikimedia.org (103.102.166.224), 64 hops max, 64 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *

Update: Site approximately takes a minute to load each page, occasionally, the site is also garbled making the finally-opened page unreadable.
Update 2: Got another error page as well:

Request from X via cp5008.eqsin.wmnet, ATS/8.0.5
Error: 504, Connection Timed Out at 2020-01-26 20:11:15 GMT
This comment was removed by Piastu.

And i've got one more tracert from someone else — Warsaw, Poland:

Tracing route to www.wikipedia.pl [94.23.242.48]
over a maximum of 30 hops:
1 3 ms 2 ms 2 ms 192.168.0.1
2 12 ms 11 ms 12 ms 10.248.0.1
3 12 ms 10 ms 13 ms 172.20.253.1
4 20 ms 17 ms 17 ms 172.17.0.10
5 29 ms 21 ms 17 ms 172.17.28.186
6 13 ms 14 ms 13 ms 172.17.28.186
7 * * * Request timed out.
8 * * * Request timed out.
9 43 ms 42 ms 46 ms fra-fr5-sbb2-nc5.de.eu [54.36.50.116]
10 46 ms 90 ms 44 ms be103.rbx-g2-nc5.fr.eu [94.23.122.240]
11 * * * Request timed out.
12 * * * Request timed out.
13 * * * Request timed out.
14 50 ms 53 ms 49 ms tools.wikimedia.pl [94.23.242.48]

Adding a trace (London, UK)

01/26/20 20:13:36 Slow traceroute 91.198.174.192
Trace 91.198.174.192 ...
[[REDACTED]]   RTT:   1ms TTL: 64 [[REDACTED]]
62.3.80.17      RTT:  10ms TTL:254 (losubs.subs.bng1.th-lon.zen.net.uk ok)
62.3.80.38      RTT:  11ms TTL:253 (ae1-177.cr2.th-lon.zen.net.uk ok)
5.57.81.75      RTT:  11ms TTL:248 (lonap.as13335.net probable bogus rDNS: No DNS)
141.101.70.122  RTT:  10ms TTL: 56 (No rDNS)
141.101.70.122  RTT:  10ms TTL: 55 (No rDNS)
* * * failed
91.198.174.192  RTT:  19ms TTL: 52 (text-lb.esams.wikimedia.org ok)
Tracing route to dyna.wikimedia.org [198.35.26.96]
over a maximum of 30 hops:

  1    <1 ms    <1 ms     1 ms  192.168.0.1
  2    13 ms    10 ms    15 ms  147.160.40.2
  3    14 ms    14 ms    11 ms  100.126.0.17
  4    14 ms    10 ms    12 ms  100.126.0.1
  5    11 ms    16 ms    13 ms  ae3.core.pgrg.citywest.ca [24.244.66.205]
  6    28 ms    29 ms    27 ms  city-west-cable-and-telephone-corp.10gigabitethernet6-9.core1.yyc1.he.net [216.218.200.74]
  7    30 ms    29 ms    27 ms  10ge6-9.core1.yyc1.he.net [216.218.200.73]
  8    38 ms    39 ms    37 ms  100ge14-2.core1.yvr1.he.net [184.105.64.113]
  9    40 ms    40 ms    42 ms  100ge10-2.core1.sea1.he.net [184.105.64.109]
 10    44 ms    47 ms    43 ms  100ge15-1.core1.pdx1.he.net [184.105.64.138]
 11    61 ms    60 ms    58 ms  100ge5-2.core1.pao1.he.net [184.104.193.25]
 12    60 ms    77 ms    61 ms  xe-0-1-1.cr4-ulsfo.wikimedia.org [198.32.176.214]
 13    61 ms    60 ms    64 ms  text-lb.ulsfo.wikimedia.org [198.35.26.96]

Trace complete.

From Birmingham, England just now:

Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:

1     3 ms     3 ms     5 ms  BrightBox.ee [192.168.1.1]
2     8 ms     8 ms     9 ms  172.16.12.155
3     *        *       13 ms  213.121.98.129
4    13 ms    16 ms    16 ms  213.121.98.128
5    18 ms    14 ms    14 ms  87.237.20.138
6    38 ms    42 ms    44 ms  linx-lon1.as13335.net [195.66.225.179]
7    15 ms    14 ms    14 ms  141.101.70.123
8    15 ms    14 ms    17 ms  141.101.70.123
9     *        *        *     Request timed out.

10 22 ms 25 ms 21 ms text-lb.esams.wikimedia.org [91.198.174.192]

Trace complete.

Stopped mine manually. Looks like it is down here too. British Columbia if it matters.

traceroute en.wikipedia.org
traceroute to dyna.wikimedia.org (198.35.26.96), 64 hops max, 52 byte packets
 1  192.168.1.254 (192.168.1.254)  5.488 ms  5.123 ms  1.664 ms
 2  10.31.128.1 (10.31.128.1)  1091.086 ms  910.696 ms  982.800 ms
 3  154.11.10.159 (154.11.10.159)  1120.761 ms  686.399 ms  999.062 ms
 4  64.125.15.12 (64.125.15.12)  1003.838 ms  63.769 ms  1001.126 ms
 5  ae7.cs2.sea1.us.zip.zayo.com (64.125.26.4)  999.608 ms  133.960 ms  908.272 ms
 6  ae3.cs2.sjc2.us.eth.zayo.com (64.125.29.40)  940.233 ms  980.503 ms  1124.357 ms
 7  ae27.cr2.sjc2.us.zip.zayo.com (64.125.30.233)  151.749 ms  36.743 ms  139.132 ms
 8  ae11.mpr4.sfo3.us.zip.zayo.com (64.125.24.226)  479.827 ms  938.800 ms  1046.789 ms
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
31  * * *
32  * * *
33  * * *
34  * * *
35  * * *
36  * * *
37  * * *
38  * * *
39  * * *
40  * * *
41  * * *
42  * * *
43  * * *
44  * * *
45  * * *
46  * * *
47  * * *
48  * * *
49  * * *
50  * * *

From Rochester, UK right now.

Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:

1     3 ms     3 ms     3 ms  [REDACTED] [[REDACTED]]
2    18 ms    18 ms    18 ms  172.16.10.162
3     *        *        *     Request timed out.
4    21 ms    37 ms    20 ms  31.55.185.188
5    58 ms    21 ms    20 ms  core2-hu0-15-0-6.c.[REDACTED].net [213.121.192.32]
6    30 ms    20 ms    20 ms  core2-hu0-7-0-0.[REDACTED].net [194.72.16.128]
7    92 ms    84 ms    69 ms  109.159.253.95
8    21 ms    20 ms    21 ms  141.101.70.141
9    21 ms    20 ms    20 ms  141.101.70.141

10 * * * Request timed out.
11 28 ms 28 ms 29 ms text-lb.esams.wikimedia.org [91.198.174.192]

Trace complete.

Jdforrester-WMF renamed this task from Time-out error to Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage.Jan 26 2020, 11:05 PM

This seems to have been made worse by T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed. The patches for that Wikibase/Babel issue are already merged into master and should be deployed with the next train scheduled for 2020-02-05(?).

I think the priority of this patch could be lowered now, as the immediate flood of outages seems to be over.

jcrespo lowered the priority of this task from Unbreak Now! to High.Jan 27 2020, 11:24 AM

This seems to have been made worse by T243725.

I may be misunderstanding you, but do you mean there is some pre-existing issue relating to this outage separate from T243725? If so, what is that pre-existing issue? Both T243725 and this incident report were filed around the same time. My understanding is that T243725 is the pre-existing issue (I didn't see a recent change to its logic), and some unknown factor yesterday made it worse by exposing it much more commonly than before (possibly some kind of cache invalidation).

Again, or something different?

(Just now, from Birmingham, England)

Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:

1     8 ms     3 ms     6 ms  BrightBox.ee [192.168.1.1]
2    13 ms    17 ms    10 ms  172.16.12.155
3     *        *       15 ms  213.121.98.129
4    12 ms    16 ms    13 ms  213.121.98.128
5    30 ms    15 ms    23 ms  87.237.20.138
6    35 ms    16 ms    45 ms  linx-lon1.as13335.net [195.66.225.179]
7    23 ms    17 ms    14 ms  141.101.70.127
8    24 ms    28 ms    15 ms  141.101.70.127
9     *        *        *     Request timed out.

10 * * * Request timed out.
11 * * * Request timed out.
12 * * * Request timed out.
13 * * * Request timed out.
14 * * * Request timed out.
15 * * * Request timed out.
16 * * * Request timed out.
17 * * * Request timed out.
18 * * * Request timed out.
19 * * * Request timed out.
20 * * * Request timed out.
21 * * * Request timed out.
22 * * * Request timed out.
23 * *

This seems to have been made worse by T243725.

I may be misunderstanding you, but do you mean there is some pre-existing issue relating to this outage separate from T243725? If so, what is that pre-existing issue? Both T243725 and this incident report were filed around the same time. My understanding is that T243725 is the pre-existing issue (I didn't see a recent change to its logic), and some unknown factor yesterday made it worse by exposing it much more commonly than before (possibly some kind of cache invalidation).

I believe you are both on the same page, I think that just meant that this overload outage was made worse by T243725 being a thing.

Change 567366 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Babel@master] Lower timeout for API reqs from Babel from 10s to 1s

https://gerrit.wikimedia.org/r/567366

Change 567366 abandoned by Krinkle:
Lower timeout for API reqs from Babel from 10s to 1s

https://gerrit.wikimedia.org/r/567366

Change 570635 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@master] Timeout for meta api call from 10 to 3 seconds.

https://gerrit.wikimedia.org/r/570635

Change 570635 merged by jenkins-bot:
[mediawiki/extensions/Babel@master] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570635

Change 570663 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.18] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570663

Change 570664 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.16] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570664

Change 570664 merged by jenkins-bot:
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.16] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570664

Change 570663 merged by jenkins-bot:
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.18] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570663

Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:28:40Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:29:57Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)

Addshore claimed this task.

I'm gonna go ahead and close this as a timeout decrease as been deployed and the patch to remove the API call all together will be deployed next week.
Also Babel itself probably didn't cause the outage, but was only one of the more visible results of the API stress.