Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage
Closed, ResolvedPublicBUG REPORT
Actions

Description

I am not sure if this relates to T243548 or T242501 but yesterday and today I keep running into time-out errors or pages failing to load on all Wikimedia websites. It's an intermittent issue but over half of the time the pages don't load.

Details

Subject	Repo	Branch	Lines +/-
Timeout for meta api call from 10 to 2 seconds.	mediawiki/extensions/Babel	wmf/1.35.0-wmf.18	+1 -1
Timeout for meta api call from 10 to 2 seconds.	mediawiki/extensions/Babel	wmf/1.35.0-wmf.16	+1 -1
Timeout for meta api call from 10 to 2 seconds.	mediawiki/extensions/Babel	master	+1 -1
Lower timeout for API reqs from Babel from 10s to 1s	mediawiki/extensions/Babel	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T243915: Performance audit of Wikibase/Wikidata, Jan 2020
T243729: Check Wikibase services for expensive setup patterns (db & api calls etc) & fix / remove
T243727: Internal API HTTP requests should hit the app server internal service IP, not the public HTTPS URL
T243726: Babel should get cross-wiki languages via DB instead of making an HTTP request
T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed
Mentioned Here: T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed
T242501: Repeated 503 Errors
T243548: Elevated response times and CPU usage after deploy of 1.35.0-wmf.16 to all wikis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2020, 4:27 PM

JEumerus updated the task description. (Show Details)Jan 26 2020, 4:31 PM

JEumerus changed the subtype of this task from "Task" to "Bug Report".

JEumerus added a project: Performance Issue.

Piastu subscribed.Jan 26 2020, 4:35 PM

Since about two hours I also have timeouts on API and webpage. (Location Berlin)
The Grafana boards also show low edit rates and API requests and HTTP Errors.

Trialpears subscribed.Jan 26 2020, 4:49 PM

Reedy added a project: Wikimedia-production-error.Jan 26 2020, 4:53 PM

RhinosF1 subscribed.Jan 26 2020, 5:20 PM

ZioNicco subscribed.Jan 26 2020, 6:16 PM

T242501 is most probably unrelated.

This was a major outage affecting most of Europe (e. g. problems in esams / Amsterdam datacenter). Now it seems to be finally over.

I am from Czech republic and I was unable to open any wikimedia page (wikipedia, phabricator etc.) on desktop for more than an hour (17:30 - 18:30 UTC), but the outage itself most probably lasted much longer (according to reports on downdetector.com website).

Probably worth a Wikimedia-Incident label. Not UBN!, because the site is back now.

Vachovec1 edited projects, added SRE, Traffic, Wikimedia-Incident; removed Wikimedia-production-error.Jan 26 2020, 7:16 PM

If you are still currently experiencing connectivity issues, please let us know, and when you do, please also report:

The IP address to which you resolve en.wikipedia.org
The output of a TCP traceroute to port 443 of en.wikipedia.org

On Linux computers you should be able to do traceroute --tcp --port 443 en.wikipedia.org or mtr --report-wide --aslookup --tcp --port 443 en.wikipedia.org

On Windows computers you can open a command prompt and do tracert en.wikipedia.org

Thanks!

Dvorapa subscribed.Jan 26 2020, 7:47 PM

Piastu added a comment.Jan 26 2020, 8:00 PM

This comment was removed by Piastu.

Łódź, Poland
Some "504 Gateway Time-out" for a while (on pl.wikipedia.org)
After that tracert looks like this:

traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets
 1  192.168.1.100 (192.168.1.100)  1.641 ms  6.580 ms  6.580 ms
 2  * * *
 3  * * *
 4  * * *
 5  * 89.228.6.6 (89.228.6.6)  15.771 ms *
 6  * 89.228.6.1 (89.228.6.1)  12.570 ms *
 7  * * *
 8  * * *
 9  * * *
10  162.158.101.22 (162.158.101.22)  16.883 ms  16.849 ms 162.158.101.80 (162.158.101.80)  17.655 ms
11  162.158.101.22 (162.158.101.22)  17.292 ms  17.005 ms 162.158.101.80 (162.158.101.80)  17.090 ms
12  * * *
13  text-lb.esams.wikimedia.org (91.198.174.192)  40.610 ms  40.185 ms  45.367 ms

and one more:

traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets
 1  192.168.1.100 (192.168.1.100)  1.407 ms  7.217 ms  7.213 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  89.228.6.1 (89.228.6.1)  19.220 ms * *
 7  * * *
 8  * * et-0-0-23.bar4.Warsaw1.Level3.net (213.242.118.177)  12.536 ms
 9  * * *
10  162.158.101.22 (162.158.101.22)  15.863 ms  17.780 ms 162.158.101.80 (162.158.101.80)  17.682 ms
11  162.158.101.80 (162.158.101.80)  17.832 ms  17.923 ms  18.011 ms
12  * * *
13  text-lb.esams.wikimedia.org (91.198.174.192)  34.336 ms  55.508 ms  51.682 ms

and the last one:

traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets
 1  192.168.1.100 (192.168.1.100)  1.629 ms  5.759 ms  5.765 ms
 2  * * *
 3  * * 10.0.2.5 (10.0.2.5)  15.424 ms
 4  * * *
 5  * * *
 6  89.228.6.1 (89.228.6.1)  26.905 ms *  121.946 ms
 7  host-89-228-4-5.dynamic.mm.pl (89.228.4.5)  122.545 ms * *
 8  et-0-0-23.bar4.Warsaw1.Level3.net (213.242.118.177)  10.686 ms * *
 9  dialup-212.162.18.186.frankfurt1.eu.level3.net (213.242.117.186)  43.082 ms * *
10  162.158.101.80 (162.158.101.80)  19.219 ms  19.046 ms  19.002 ms
11  162.158.101.22 (162.158.101.22)  19.090 ms  19.013 ms 162.158.101.80 (162.158.101.80)  19.310 ms
12  * * *
13  text-lb.esams.wikimedia.org (91.198.174.192)  43.560 ms  44.304 ms  44.279 ms

Seems to work, but not very stable.
enwiki resolves to 91.198.174.192

Im experiancing significant slowness (but it eventually works) connecting to 198.35.26.96 (dyna.wikimedia.org) [on a cellphone, cant easily get a traceroute]

TheSandDoctor subscribed.Jan 26 2020, 8:03 PM

Got an error page (Error 504) once and now site does not load at all.
My traceroute is completely empty:

traceroute to dyna.wikimedia.org (103.102.166.224), 64 hops max, 64 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *

Update: Site approximately takes a minute to load each page, occasionally, the site is also garbled making the finally-opened page unreadable.
Update 2: Got another error page as well:

Request from X via cp5008.eqsin.wmnet, ATS/8.0.5
Error: 504, Connection Timed Out at 2020-01-26 20:11:15 GMT

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJan 26 2020, 8:06 PM

DannyS712 subscribed.Jan 26 2020, 8:07 PM

taavi subscribed.Jan 26 2020, 8:08 PM

Piastu added a comment.Jan 26 2020, 8:13 PM

This comment was removed by Piastu.

And i've got one more tracert from someone else — Warsaw, Poland:

Tracing route to www.wikipedia.pl [94.23.242.48]
over a maximum of 30 hops:
1 3 ms 2 ms 2 ms 192.168.0.1
2 12 ms 11 ms 12 ms 10.248.0.1
3 12 ms 10 ms 13 ms 172.20.253.1
4 20 ms 17 ms 17 ms 172.17.0.10
5 29 ms 21 ms 17 ms 172.17.28.186
6 13 ms 14 ms 13 ms 172.17.28.186
7 * * * Request timed out.
8 * * * Request timed out.
9 43 ms 42 ms 46 ms fra-fr5-sbb2-nc5.de.eu [54.36.50.116]
10 46 ms 90 ms 44 ms be103.rbx-g2-nc5.fr.eu [94.23.122.240]
11 * * * Request timed out.
12 * * * Request timed out.
13 * * * Request timed out.
14 50 ms 53 ms 49 ms tools.wikimedia.pl [94.23.242.48]

Adding a trace (London, UK)

01/26/20 20:13:36 Slow traceroute 91.198.174.192
Trace 91.198.174.192 ...
[[REDACTED]]   RTT:   1ms TTL: 64 [[REDACTED]]
62.3.80.17      RTT:  10ms TTL:254 (losubs.subs.bng1.th-lon.zen.net.uk ok)
62.3.80.38      RTT:  11ms TTL:253 (ae1-177.cr2.th-lon.zen.net.uk ok)
5.57.81.75      RTT:  11ms TTL:248 (lonap.as13335.net probable bogus rDNS: No DNS)
141.101.70.122  RTT:  10ms TTL: 56 (No rDNS)
141.101.70.122  RTT:  10ms TTL: 55 (No rDNS)
* * * failed
91.198.174.192  RTT:  19ms TTL: 52 (text-lb.esams.wikimedia.org ok)

Tracing route to dyna.wikimedia.org [198.35.26.96]
over a maximum of 30 hops:

  1    <1 ms    <1 ms     1 ms  192.168.0.1
  2    13 ms    10 ms    15 ms  147.160.40.2
  3    14 ms    14 ms    11 ms  100.126.0.17
  4    14 ms    10 ms    12 ms  100.126.0.1
  5    11 ms    16 ms    13 ms  ae3.core.pgrg.citywest.ca [24.244.66.205]
  6    28 ms    29 ms    27 ms  city-west-cable-and-telephone-corp.10gigabitethernet6-9.core1.yyc1.he.net [216.218.200.74]
  7    30 ms    29 ms    27 ms  10ge6-9.core1.yyc1.he.net [216.218.200.73]
  8    38 ms    39 ms    37 ms  100ge14-2.core1.yvr1.he.net [184.105.64.113]
  9    40 ms    40 ms    42 ms  100ge10-2.core1.sea1.he.net [184.105.64.109]
 10    44 ms    47 ms    43 ms  100ge15-1.core1.pdx1.he.net [184.105.64.138]
 11    61 ms    60 ms    58 ms  100ge5-2.core1.pao1.he.net [184.104.193.25]
 12    60 ms    77 ms    61 ms  xe-0-1-1.cr4-ulsfo.wikimedia.org [198.32.176.214]
 13    61 ms    60 ms    64 ms  text-lb.ulsfo.wikimedia.org [198.35.26.96]

Trace complete.

Amorymeltzer subscribed.Jan 26 2020, 8:25 PM

mdaniels5757 subscribed.Jan 26 2020, 8:27 PM

From Birmingham, England just now:

Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:

1     3 ms     3 ms     5 ms  BrightBox.ee [192.168.1.1]
2     8 ms     8 ms     9 ms  172.16.12.155
3     *        *       13 ms  213.121.98.129
4    13 ms    16 ms    16 ms  213.121.98.128
5    18 ms    14 ms    14 ms  87.237.20.138
6    38 ms    42 ms    44 ms  linx-lon1.as13335.net [195.66.225.179]
7    15 ms    14 ms    14 ms  141.101.70.123
8    15 ms    14 ms    17 ms  141.101.70.123
9     *        *        *     Request timed out.

10 22 ms 25 ms 21 ms text-lb.esams.wikimedia.org [91.198.174.192]

Trace complete.

Masumrezarock100 subscribed.Jan 26 2020, 8:29 PM

Davey2010 subscribed.Jan 26 2020, 8:29 PM

Stopped mine manually. Looks like it is down here too. British Columbia if it matters.

traceroute en.wikipedia.org
traceroute to dyna.wikimedia.org (198.35.26.96), 64 hops max, 52 byte packets
 1  192.168.1.254 (192.168.1.254)  5.488 ms  5.123 ms  1.664 ms
 2  10.31.128.1 (10.31.128.1)  1091.086 ms  910.696 ms  982.800 ms
 3  154.11.10.159 (154.11.10.159)  1120.761 ms  686.399 ms  999.062 ms
 4  64.125.15.12 (64.125.15.12)  1003.838 ms  63.769 ms  1001.126 ms
 5  ae7.cs2.sea1.us.zip.zayo.com (64.125.26.4)  999.608 ms  133.960 ms  908.272 ms
 6  ae3.cs2.sjc2.us.eth.zayo.com (64.125.29.40)  940.233 ms  980.503 ms  1124.357 ms
 7  ae27.cr2.sjc2.us.zip.zayo.com (64.125.30.233)  151.749 ms  36.743 ms  139.132 ms
 8  ae11.mpr4.sfo3.us.zip.zayo.com (64.125.24.226)  479.827 ms  938.800 ms  1046.789 ms
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
31  * * *
32  * * *
33  * * *
34  * * *
35  * * *
36  * * *
37  * * *
38  * * *
39  * * *
40  * * *
41  * * *
42  * * *
43  * * *
44  * * *
45  * * *
46  * * *
47  * * *
48  * * *
49  * * *
50  * * *

Agusbou2015 subscribed.Jan 26 2020, 8:33 PM

Norway: https://www.irccloud.com/pastebin/kGcHnxXE
US EAST COAST: https://imgur.com/oljPECR

From Rochester, UK right now.

Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:

1     3 ms     3 ms     3 ms  [REDACTED] [[REDACTED]]
2    18 ms    18 ms    18 ms  172.16.10.162
3     *        *        *     Request timed out.
4    21 ms    37 ms    20 ms  31.55.185.188
5    58 ms    21 ms    20 ms  core2-hu0-15-0-6.c.[REDACTED].net [213.121.192.32]
6    30 ms    20 ms    20 ms  core2-hu0-7-0-0.[REDACTED].net [194.72.16.128]
7    92 ms    84 ms    69 ms  109.159.253.95
8    21 ms    20 ms    21 ms  141.101.70.141
9    21 ms    20 ms    20 ms  141.101.70.141

10 * * * Request timed out.
11 28 ms 28 ms 29 ms text-lb.esams.wikimedia.org [91.198.174.192]

Trace complete.

Paladox subscribed.Jan 26 2020, 8:51 PM

Xaosflux subscribed.Jan 26 2020, 8:55 PM

Thryduulf subscribed.Jan 26 2020, 9:16 PM

BethNaught subscribed.Jan 26 2020, 9:22 PM

Addshore subscribed.Jan 26 2020, 9:33 PM

Jdforrester-WMF renamed this task from Time-out error to Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage.Jan 26 2020, 11:05 PM

mdaniels5757 unsubscribed.Jan 26 2020, 11:12 PM

mdaniels5757 subscribed.

Addshore mentioned this in T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed.Jan 27 2020, 10:51 AM

Addshore mentioned this in T243726: Babel should get cross-wiki languages via DB instead of making an HTTP request.

Addshore mentioned this in T243727: Internal API HTTP requests should hit the app server internal service IP, not the public HTTPS URL.

Addshore mentioned this in T243729: Check Wikibase services for expensive setup patterns (db & api calls etc) & fix / remove.

Count_Count subscribed.Jan 27 2020, 10:57 AM

This seems to have been made worse by T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed. The patches for that Wikibase/Babel issue are already merged into master and should be deployed with the next train scheduled for 2020-02-05(?).

I think the priority of this patch could be lowered now, as the immediate flood of outages seems to be over.

jcrespo lowered the priority of this task from Unbreak Now! to High.Jan 27 2020, 11:24 AM

In T243713#5833717, @Michael wrote:

This seems to have been made worse by T243725.

I may be misunderstanding you, but do you mean there is some pre-existing issue relating to this outage separate from T243725? If so, what is that pre-existing issue? Both T243725 and this incident report were filed around the same time. My understanding is that T243725 is the pre-existing issue (I didn't see a recent change to its logic), and some unknown factor yesterday made it worse by exposing it much more commonly than before (possibly some kind of cache invalidation).

Again, or something different?

(Just now, from Birmingham, England)

Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:

1     8 ms     3 ms     6 ms  BrightBox.ee [192.168.1.1]
2    13 ms    17 ms    10 ms  172.16.12.155
3     *        *       15 ms  213.121.98.129
4    12 ms    16 ms    13 ms  213.121.98.128
5    30 ms    15 ms    23 ms  87.237.20.138
6    35 ms    16 ms    45 ms  linx-lon1.as13335.net [195.66.225.179]
7    23 ms    17 ms    14 ms  141.101.70.127
8    24 ms    28 ms    15 ms  141.101.70.127
9     *        *        *     Request timed out.

10 * * * Request timed out.
11 * * * Request timed out.
12 * * * Request timed out.
13 * * * Request timed out.
14 * * * Request timed out.
15 * * * Request timed out.
16 * * * Request timed out.
17 * * * Request timed out.
18 * * * Request timed out.
19 * * * Request timed out.
20 * * * Request timed out.
21 * * * Request timed out.
22 * * * Request timed out.
23 * *

ZioNicco unsubscribed.Jan 28 2020, 8:49 AM

In T243713#5835243, @Krinkle wrote:

In T243713#5833717, @Michael wrote:

This seems to have been made worse by T243725.

I may be misunderstanding you, but do you mean there is some pre-existing issue relating to this outage separate from T243725? If so, what is that pre-existing issue? Both T243725 and this incident report were filed around the same time. My understanding is that T243725 is the pre-existing issue (I didn't see a recent change to its logic), and some unknown factor yesterday made it worse by exposing it much more commonly than before (possibly some kind of cache invalidation).

I believe you are both on the same page, I think that just meant that this overload outage was made worse by T243725 being a thing.

Ladsgroup mentioned this in T243915: Performance audit of Wikibase/Wikidata, Jan 2020.Jan 29 2020, 4:16 PM

Change 567366 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Babel@master] Lower timeout for API reqs from Babel from 10s to 1s

https://gerrit.wikimedia.org/r/567366

Change 567366 abandoned by Krinkle:
Lower timeout for API reqs from Babel from 10s to 1s

https://gerrit.wikimedia.org/r/567366

Maintenance_bot removed a project: Patch-For-Review.Jan 29 2020, 10:10 PM

Change 570635 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@master] Timeout for meta api call from 10 to 3 seconds.

https://gerrit.wikimedia.org/r/570635

gerritbot added a project: Patch-For-Review.Feb 6 2020, 1:19 PM

Change 570635 merged by jenkins-bot:
[mediawiki/extensions/Babel@master] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570635

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.19; 2020-02-11).Feb 6 2020, 2:00 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2020, 2:10 PM

Change 570663 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.18] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570663

Change 570664 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.16] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570664

Change 570664 merged by jenkins-bot:
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.16] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570664

Change 570663 merged by jenkins-bot:
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.18] Timeout for meta api call from 10 to 2 seconds.

https://gerrit.wikimedia.org/r/570663

Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:28:40Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:29:57Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.18; 2020-02-04); removed MW-1.35-notes (1.35.0-wmf.19; 2020-02-11).Feb 6 2020, 8:01 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2020, 8:11 PM

I'm gonna go ahead and close this as a timeout decrease as been deployed and the patch to remove the API call all together will be deployed next week.
Also Babel itself probably didn't cause the outage, but was only one of the more visible results of the API stress.

Restricted Application added a project: User-Addshore. · View Herald TranscriptFeb 7 2020, 2:54 PM

JEumerus unsubscribed.Oct 26 2023, 12:34 PM

Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outageClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage
Closed, ResolvedPublicBUG REPORT
Actions