Page MenuHomePhabricator

HTTP 504 connection timeout error accessing MW API on Beta cluster
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
beta enwiki:

Error

Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes.

See the error message at the bottom of this page for more information.

If you report this error to the Wikimedia System Administrators, please include the details below.

Request from REDACTED via deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud, ATS/9.1.4
Error: 504, Connection Timed Out at 2023-11-24 12:53:26 GMT

betacommons:

504 Gateway Time-out
The server didn't respond in time.

What should have happened instead?:
HTTP 200

Event Timeline

AlexisJazz renamed this task from Betacommons API timeout to Beta cluster API timeout.Nov 24 2023, 12:53 PM
AlexisJazz updated the task description. (Show Details)
AlexisJazz added a project: Traffic.

As I don't know the cause I'm just tagging as both traffic and train blocker.

Aklapper renamed this task from Beta cluster API timeout to HTTP 504 connection timeout error accessing MW API on Beta cluster.Nov 24 2023, 1:29 PM
Aklapper removed a project: MediaWiki-Action-API.

(Removed 1.42.0-wmf.13 as that's six weeks away, and MW-API tag as I'd not expect it to be related in a 504).
Interestingly, wget gives me a HTTP/1.1 503 Backend fetch failed

TheresNoTime subscribed.

Beta being beta, let's close this as a transient.

I am getting test failures on restbase because of enwiki beta Action API returning 504. I think I first encountered it yesterday.

Seeing this when attempting to post to a Flow board on enwiki beta.

I also see 378 events in the last hour for WikiLambda, The maximum execution time of 60 seconds was exceeded for /w/api.php?action=wikilambda_health_check&format=json .

Just guessing, but perhaps the PoolCounter service for beta is offline and that's causing other issues

image.png (934×1 px, 201 KB)

Just guessing, but perhaps the PoolCounter service for beta is offline and that's causing other issues

image.png (934×1 px, 201 KB)

Looking at

image.png (1×1 px, 87 KB)

Seems like we want to find what happened on November 21.

api.php is currently handled by deployment-mediawiki11 and that instance is unreacheable ATM:

vgutierrez@deployment-cache-text08:~$ sudo cat /etc/trafficserver/remap.config |fgrep api.php
regex_map http://(.*)/w/api.php http://deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud/w/api.php @plugin=/usr/lib/trafficserver/modules/tslua.so @pparam=/etc/trafficserver/lua/rb-mw-mangling.lua
vgutierrez@deployment-cache-text08:~$ nc -w 3 -zv deployment-mediawiki11 80
nc: connect to deployment-mediawiki11 (172.16.3.203) port 80 (tcp) timed out: Operation now in progress
kostajh triaged this task as Unbreak Now! priority.Nov 30 2023, 2:58 PM

Raising the priority as the beta cluster is unusable without API connections.

The timing of the spike in errors aligns with rMW4344b2fb8072: PoolCounterConnectionManager: Add support for ipv6 being merged, cc @Paladox, @Krinkle, @tstarling

@Michael pointed out that this was just reverted (T352444), and Beta seems to have gotten better since then. How is it working for others now?

The timing of the spike in errors aligns with rMW4344b2fb8072: PoolCounterConnectionManager: Add support for ipv6 being merged, cc @Paladox, @Krinkle, @tstarling

@Michael pointed out that this was just reverted (T352444), and Beta seems to have gotten better since then. How is it working for others now?

API requests seem to work now.

I just realized it's working again. (took me a moment to realize the "no response" error from my script vanished)

kostajh claimed this task.

Re-opening, since we still see CI for restbasefailing on github, die to timeout errors from beta. E.g:

Vgutierrez lowered the priority of this task from Unbreak Now! to High.Dec 4 2023, 11:55 AM

@daniel what's time outing is parsoid-external-ci-access.beta.wmflabs.org:

$ curl --http1.1 https://sr.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/RESTBase_Testing_Page -v -o /dev/null -s 2>&1 |grep HTTP/1.1
* using HTTP/1.1
> GET /api/rest_v1/page/html/RESTBase_Testing_Page HTTP/1.1
< HTTP/1.1 200 OK

But

$ nc -w 3 -zv parsoid-external-ci-access.beta.wmflabs.org 80
nc: connect to parsoid-external-ci-access.beta.wmflabs.org (185.15.56.9) port 80 (tcp) timed out: Operation now in progress

A quick check on deployment-parsoid12 tells that ferm rules on that instance are at fault (host FW is dropping traffic towards port 80):

Dec  4 15:38:53 deployment-parsoid12 ulogd[15249]: [fw-in-drop] IN=eth0 OUT= MAC=fa:16:3e:db:ed:c8:fa:16:3e:ae:f5:88:08:00 SRC=<REDACTED> DST=172.16.4.125 LEN=60 TOS=00 PREC=0x00 TTL=49 ID=36184 DF PROTO=TCP SPT=54462 DPT=80 SEQ=3181076532 ACK=0 WINDOW=64240 SYN URGP=0 MARK=0

The instance currently allows traffic to port 80 from the following IP ranges:

iptables -L -n |grep 80
ACCEPT     tcp  --  172.16.0.0/21        0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.16.128.0/24      0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.20.1.0/24        0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.20.2.0/24        0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.20.255.0/24      0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.20.3.0/24        0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.20.4.0/24        0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  172.20.5.0/24        0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  185.15.56.0/25       0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  185.15.56.160/28     0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  185.15.57.0/29       0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  185.15.57.16/29      0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  185.15.57.24/29      0.0.0.0/0            tcp dpt:80

Ah right, that's T350353: Parsoid instance on beta not accesible from restbase CI/dev envs.

I failed to realize that PRs aren't automatically rebased on master before CI runs. Let's see how it goes after I rebase.

CI works after rebase. Sorry for the noise.