Description
Details
Related Objects
- Mentioned In
- T243915: Performance audit of Wikibase/Wikidata, Jan 2020
T243729: Check Wikibase services for expensive setup patterns (db & api calls etc) & fix / remove
T243727: Internal API HTTP requests should hit the app server internal service IP, not the public HTTPS URL
T243726: Babel should get cross-wiki languages via DB instead of making an HTTP request
T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed - Mentioned Here
- T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed
T242501: Repeated 503 Errors
T243548: Elevated response times and CPU usage after deploy of 1.35.0-wmf.16 to all wikis
Event Timeline
Since about two hours I also have timeouts on API and webpage. (Location Berlin)
The Grafana boards also show low edit rates and API requests and HTTP Errors.
T242501 is most probably unrelated.
This was a major outage affecting most of Europe (e. g. problems in esams / Amsterdam datacenter). Now it seems to be finally over.
I am from Czech republic and I was unable to open any wikimedia page (wikipedia, phabricator etc.) on desktop for more than an hour (17:30 - 18:30 UTC), but the outage itself most probably lasted much longer (according to reports on downdetector.com website).
Probably worth a Wikimedia-Incident label. Not UBN!, because the site is back now.
If you are still currently experiencing connectivity issues, please let us know, and when you do, please also report:
- The IP address to which you resolve en.wikipedia.org
- The output of a TCP traceroute to port 443 of en.wikipedia.org
On Linux computers you should be able to do traceroute --tcp --port 443 en.wikipedia.org or mtr --report-wide --aslookup --tcp --port 443 en.wikipedia.org
On Windows computers you can open a command prompt and do tracert en.wikipedia.org
Thanks!
Łódź, Poland
Some "504 Gateway Time-out" for a while (on pl.wikipedia.org)
After that tracert looks like this:
traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets 1 192.168.1.100 (192.168.1.100) 1.641 ms 6.580 ms 6.580 ms 2 * * * 3 * * * 4 * * * 5 * 89.228.6.6 (89.228.6.6) 15.771 ms * 6 * 89.228.6.1 (89.228.6.1) 12.570 ms * 7 * * * 8 * * * 9 * * * 10 162.158.101.22 (162.158.101.22) 16.883 ms 16.849 ms 162.158.101.80 (162.158.101.80) 17.655 ms 11 162.158.101.22 (162.158.101.22) 17.292 ms 17.005 ms 162.158.101.80 (162.158.101.80) 17.090 ms 12 * * * 13 text-lb.esams.wikimedia.org (91.198.174.192) 40.610 ms 40.185 ms 45.367 ms
and one more:
traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets 1 192.168.1.100 (192.168.1.100) 1.407 ms 7.217 ms 7.213 ms 2 * * * 3 * * * 4 * * * 5 * * * 6 89.228.6.1 (89.228.6.1) 19.220 ms * * 7 * * * 8 * * et-0-0-23.bar4.Warsaw1.Level3.net (213.242.118.177) 12.536 ms 9 * * * 10 162.158.101.22 (162.158.101.22) 15.863 ms 17.780 ms 162.158.101.80 (162.158.101.80) 17.682 ms 11 162.158.101.80 (162.158.101.80) 17.832 ms 17.923 ms 18.011 ms 12 * * * 13 text-lb.esams.wikimedia.org (91.198.174.192) 34.336 ms 55.508 ms 51.682 ms
and the last one:
traceroute to en.wikipedia.org (91.198.174.192), 30 hops max, 60 byte packets 1 192.168.1.100 (192.168.1.100) 1.629 ms 5.759 ms 5.765 ms 2 * * * 3 * * 10.0.2.5 (10.0.2.5) 15.424 ms 4 * * * 5 * * * 6 89.228.6.1 (89.228.6.1) 26.905 ms * 121.946 ms 7 host-89-228-4-5.dynamic.mm.pl (89.228.4.5) 122.545 ms * * 8 et-0-0-23.bar4.Warsaw1.Level3.net (213.242.118.177) 10.686 ms * * 9 dialup-212.162.18.186.frankfurt1.eu.level3.net (213.242.117.186) 43.082 ms * * 10 162.158.101.80 (162.158.101.80) 19.219 ms 19.046 ms 19.002 ms 11 162.158.101.22 (162.158.101.22) 19.090 ms 19.013 ms 162.158.101.80 (162.158.101.80) 19.310 ms 12 * * * 13 text-lb.esams.wikimedia.org (91.198.174.192) 43.560 ms 44.304 ms 44.279 ms
Seems to work, but not very stable.
enwiki resolves to 91.198.174.192
Im experiancing significant slowness (but it eventually works) connecting to 198.35.26.96 (dyna.wikimedia.org) [on a cellphone, cant easily get a traceroute]
Got an error page (Error 504) once and now site does not load at all.
My traceroute is completely empty:
traceroute to dyna.wikimedia.org (103.102.166.224), 64 hops max, 64 byte packets 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * *
Update: Site approximately takes a minute to load each page, occasionally, the site is also garbled making the finally-opened page unreadable.
Update 2: Got another error page as well:
Request from X via cp5008.eqsin.wmnet, ATS/8.0.5 Error: 504, Connection Timed Out at 2020-01-26 20:11:15 GMT
And i've got one more tracert from someone else — Warsaw, Poland:
Tracing route to www.wikipedia.pl [94.23.242.48] over a maximum of 30 hops: 1 3 ms 2 ms 2 ms 192.168.0.1 2 12 ms 11 ms 12 ms 10.248.0.1 3 12 ms 10 ms 13 ms 172.20.253.1 4 20 ms 17 ms 17 ms 172.17.0.10 5 29 ms 21 ms 17 ms 172.17.28.186 6 13 ms 14 ms 13 ms 172.17.28.186 7 * * * Request timed out. 8 * * * Request timed out. 9 43 ms 42 ms 46 ms fra-fr5-sbb2-nc5.de.eu [54.36.50.116] 10 46 ms 90 ms 44 ms be103.rbx-g2-nc5.fr.eu [94.23.122.240] 11 * * * Request timed out. 12 * * * Request timed out. 13 * * * Request timed out. 14 50 ms 53 ms 49 ms tools.wikimedia.pl [94.23.242.48]
Adding a trace (London, UK)
01/26/20 20:13:36 Slow traceroute 91.198.174.192 Trace 91.198.174.192 ... [[REDACTED]] RTT: 1ms TTL: 64 [[REDACTED]] 62.3.80.17 RTT: 10ms TTL:254 (losubs.subs.bng1.th-lon.zen.net.uk ok) 62.3.80.38 RTT: 11ms TTL:253 (ae1-177.cr2.th-lon.zen.net.uk ok) 5.57.81.75 RTT: 11ms TTL:248 (lonap.as13335.net probable bogus rDNS: No DNS) 141.101.70.122 RTT: 10ms TTL: 56 (No rDNS) 141.101.70.122 RTT: 10ms TTL: 55 (No rDNS) * * * failed 91.198.174.192 RTT: 19ms TTL: 52 (text-lb.esams.wikimedia.org ok)
Tracing route to dyna.wikimedia.org [198.35.26.96] over a maximum of 30 hops: 1 <1 ms <1 ms 1 ms 192.168.0.1 2 13 ms 10 ms 15 ms 147.160.40.2 3 14 ms 14 ms 11 ms 100.126.0.17 4 14 ms 10 ms 12 ms 100.126.0.1 5 11 ms 16 ms 13 ms ae3.core.pgrg.citywest.ca [24.244.66.205] 6 28 ms 29 ms 27 ms city-west-cable-and-telephone-corp.10gigabitethernet6-9.core1.yyc1.he.net [216.218.200.74] 7 30 ms 29 ms 27 ms 10ge6-9.core1.yyc1.he.net [216.218.200.73] 8 38 ms 39 ms 37 ms 100ge14-2.core1.yvr1.he.net [184.105.64.113] 9 40 ms 40 ms 42 ms 100ge10-2.core1.sea1.he.net [184.105.64.109] 10 44 ms 47 ms 43 ms 100ge15-1.core1.pdx1.he.net [184.105.64.138] 11 61 ms 60 ms 58 ms 100ge5-2.core1.pao1.he.net [184.104.193.25] 12 60 ms 77 ms 61 ms xe-0-1-1.cr4-ulsfo.wikimedia.org [198.32.176.214] 13 61 ms 60 ms 64 ms text-lb.ulsfo.wikimedia.org [198.35.26.96] Trace complete.
From Birmingham, England just now:
Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:
1 3 ms 3 ms 5 ms BrightBox.ee [192.168.1.1] 2 8 ms 8 ms 9 ms 172.16.12.155 3 * * 13 ms 213.121.98.129 4 13 ms 16 ms 16 ms 213.121.98.128 5 18 ms 14 ms 14 ms 87.237.20.138 6 38 ms 42 ms 44 ms linx-lon1.as13335.net [195.66.225.179] 7 15 ms 14 ms 14 ms 141.101.70.123 8 15 ms 14 ms 17 ms 141.101.70.123 9 * * * Request timed out.
10 22 ms 25 ms 21 ms text-lb.esams.wikimedia.org [91.198.174.192]
Trace complete.
Stopped mine manually. Looks like it is down here too. British Columbia if it matters.
traceroute en.wikipedia.org traceroute to dyna.wikimedia.org (198.35.26.96), 64 hops max, 52 byte packets 1 192.168.1.254 (192.168.1.254) 5.488 ms 5.123 ms 1.664 ms 2 10.31.128.1 (10.31.128.1) 1091.086 ms 910.696 ms 982.800 ms 3 154.11.10.159 (154.11.10.159) 1120.761 ms 686.399 ms 999.062 ms 4 64.125.15.12 (64.125.15.12) 1003.838 ms 63.769 ms 1001.126 ms 5 ae7.cs2.sea1.us.zip.zayo.com (64.125.26.4) 999.608 ms 133.960 ms 908.272 ms 6 ae3.cs2.sjc2.us.eth.zayo.com (64.125.29.40) 940.233 ms 980.503 ms 1124.357 ms 7 ae27.cr2.sjc2.us.zip.zayo.com (64.125.30.233) 151.749 ms 36.743 ms 139.132 ms 8 ae11.mpr4.sfo3.us.zip.zayo.com (64.125.24.226) 479.827 ms 938.800 ms 1046.789 ms 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * * 31 * * * 32 * * * 33 * * * 34 * * * 35 * * * 36 * * * 37 * * * 38 * * * 39 * * * 40 * * * 41 * * * 42 * * * 43 * * * 44 * * * 45 * * * 46 * * * 47 * * * 48 * * * 49 * * * 50 * * *
Norway: https://www.irccloud.com/pastebin/kGcHnxXE
US EAST COAST: https://imgur.com/oljPECR
From Rochester, UK right now.
Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:
1 3 ms 3 ms 3 ms [REDACTED] [[REDACTED]] 2 18 ms 18 ms 18 ms 172.16.10.162 3 * * * Request timed out. 4 21 ms 37 ms 20 ms 31.55.185.188 5 58 ms 21 ms 20 ms core2-hu0-15-0-6.c.[REDACTED].net [213.121.192.32] 6 30 ms 20 ms 20 ms core2-hu0-7-0-0.[REDACTED].net [194.72.16.128] 7 92 ms 84 ms 69 ms 109.159.253.95 8 21 ms 20 ms 21 ms 141.101.70.141 9 21 ms 20 ms 20 ms 141.101.70.141
10 * * * Request timed out.
11 28 ms 28 ms 29 ms text-lb.esams.wikimedia.org [91.198.174.192]
Trace complete.
This seems to have been made worse by T243725: HtmlPageLinkRendererBeginHookHandler creates a LanguageFallbackChain when it may not be needed. The patches for that Wikibase/Babel issue are already merged into master and should be deployed with the next train scheduled for 2020-02-05(?).
I think the priority of this patch could be lowered now, as the immediate flood of outages seems to be over.
I may be misunderstanding you, but do you mean there is some pre-existing issue relating to this outage separate from T243725? If so, what is that pre-existing issue? Both T243725 and this incident report were filed around the same time. My understanding is that T243725 is the pre-existing issue (I didn't see a recent change to its logic), and some unknown factor yesterday made it worse by exposing it much more commonly than before (possibly some kind of cache invalidation).
Again, or something different?
(Just now, from Birmingham, England)
Tracing route to dyna.wikimedia.org [91.198.174.192]
over a maximum of 30 hops:
1 8 ms 3 ms 6 ms BrightBox.ee [192.168.1.1] 2 13 ms 17 ms 10 ms 172.16.12.155 3 * * 15 ms 213.121.98.129 4 12 ms 16 ms 13 ms 213.121.98.128 5 30 ms 15 ms 23 ms 87.237.20.138 6 35 ms 16 ms 45 ms linx-lon1.as13335.net [195.66.225.179] 7 23 ms 17 ms 14 ms 141.101.70.127 8 24 ms 28 ms 15 ms 141.101.70.127 9 * * * Request timed out.
10 * * * Request timed out.
11 * * * Request timed out.
12 * * * Request timed out.
13 * * * Request timed out.
14 * * * Request timed out.
15 * * * Request timed out.
16 * * * Request timed out.
17 * * * Request timed out.
18 * * * Request timed out.
19 * * * Request timed out.
20 * * * Request timed out.
21 * * * Request timed out.
22 * * * Request timed out.
23 * *
I believe you are both on the same page, I think that just meant that this overload outage was made worse by T243725 being a thing.
Change 567366 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Babel@master] Lower timeout for API reqs from Babel from 10s to 1s
Change 567366 abandoned by Krinkle:
Lower timeout for API reqs from Babel from 10s to 1s
Change 570635 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@master] Timeout for meta api call from 10 to 3 seconds.
Change 570635 merged by jenkins-bot:
[mediawiki/extensions/Babel@master] Timeout for meta api call from 10 to 2 seconds.
Change 570663 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.18] Timeout for meta api call from 10 to 2 seconds.
Change 570664 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.16] Timeout for meta api call from 10 to 2 seconds.
Change 570664 merged by jenkins-bot:
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.16] Timeout for meta api call from 10 to 2 seconds.
Change 570663 merged by jenkins-bot:
[mediawiki/extensions/Babel@wmf/1.35.0-wmf.18] Timeout for meta api call from 10 to 2 seconds.
Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:28:40Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:29:57Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
I'm gonna go ahead and close this as a timeout decrease as been deployed and the patch to remove the API call all together will be deployed next week.
Also Babel itself probably didn't cause the outage, but was only one of the more visible results of the API stress.