Page MenuHomePhabricator

Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022)
Closed, ResolvedPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:

  • Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8 Error: 502, Next Hop Connection Failed at 2022-02-28 03:52:49 GMT
  • Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8 Error: 502, Next Hop Connection Failed at 2022-02-28 03:50:39 GMT

What should have happened instead?:
200

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:
https://downforeveryoneorjustme.com/commons.wikimedia.beta.wmflabs.org:

It's not just you! commons.wikimedia.beta.wmflabs.org is down.

Traceroute for commons.wikimedia.beta.wmflabs.org past AMS-IX:

9  ae1-403.cr2-esams.wikimedia.org (91.198.174.254)  19.741 ms  19.879 ms  19.748 ms
10  xe-4-1-3.cr2-eqiad.wikimedia.org (91.198.174.248)  105.683 ms  108.986 ms  108.827 ms
11  * * *
12  cloudgw1002.eqiad1.wikimediacloud.org (185.15.56.246)  98.976 ms  99.085 ms  98.981 ms
13  m.wikidata.beta.wmflabs.org (185.15.56.36)  99.937 ms  103.200 ms  102.996 ms
14  m.wikidata.beta.wmflabs.org (185.15.56.36)  99.628 ms  99.571 ms  99.589 ms

Beats me why it says m.wikidata.beta.wmflabs.org, I ran it twice.

Event Timeline

Hmm, it works again.

Edit: 15 minutes later and it's down again.

I am also experiencing unreliability. Particularly when trying to save edits.

In logstash I am seeing lots of different errors, for example:

AH00288: scoreboard is full, not at MaxRequestWorkers[mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers

(https://beta-logs.wmcloud.org/app/discover#/doc/5f0c9be0-0b6f-11ec-9cde-3f4490e09a26/logstash-2022.02.28?id=wh24P38B89ekSwro2Jdr)

and

Failed to fetch API response from enwiki. Error code {code}

(https://beta-logs.wmcloud.org/app/dashboards#/doc/5f0c9be0-0b6f-11ec-9cde-3f4490e09a26/logstash-mediawiki-2022.02.28?id=hB24P38B89ekSwrotJf_)

Don't know if they are worth raising separately or if they are all symptoms of the same problem.

root@deployment-cache-text06:/var/log/trafficserver# for i in {1..5}; do nc -zv -w 5 deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80; done                        
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress

deployment-cache-text06 is having issues reaching deployment-mediawiki11, @MMandere is currently investigating the issue

apache2 is currently screaming on deploiyment-mediawiki11:

Feb 28 10:54:28 deployment-mediawiki11 apache2[6975]: [mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers
Feb 28 10:54:29 deployment-mediawiki11 apache2[6975]: [mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers
Feb 28 10:54:30 deployment-mediawiki11 apache2[6975]: [mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers
Feb 28 10:54:31 deployment-mediawiki11 apache2[6975]: [mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers

I've restarted apache2 right now, but it looks like the server could be handling more requests than usual:

Feb 28 10:56:08 deployment-mediawiki11 apache2[386]: [mpm_worker:error] [pid 386:tid 140184610657408] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
JMeybohm triaged this task as Medium priority.Mar 2 2022, 10:55 AM
JMeybohm subscribed.

deployment-mediawiki11 has been replaced by deployment-mediawiki12 (although there still is a reference in scap::dsh::groups of hieradata/cloud/eqiad1/deployment-prep/common.yaml

hmm if that's the case horizon data for deployment-prep-cache needs to be updated as well cause right now deployment-mediawiki11 is being referenced there:

replacement: http://deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud/w/api.php
target: http://(.*)/w/api.php

I was really just relaying from T300525 but it looks like something is off. deployment-mediawiki11 was replaced in T300591; https://gerrit.wikimedia.org/r/c/operations/puppet/+/758584, then readded to the the dsh group in https://gerrit.wikimedia.org/r/c/operations/puppet/+/759559

But the source of the replacement line I don't see. There is another reference in modules/pontoon/files/traffic/hiera/cache.yaml though.

I was really just relaying from T300525 but it looks like something is off. deployment-mediawiki11 was replaced in T300591; https://gerrit.wikimedia.org/r/c/operations/puppet/+/758584, then readded to the the dsh group in https://gerrit.wikimedia.org/r/c/operations/puppet/+/759559

This is an excellent example of why T215217: deployment-prep (beta cluster): Code stewardship request matters :-)

Some backstory:

  • mw11 was created by me when upgrading these hosts from Stretch to Buster at some point last year. It was serving all of the traffic until the changes in T300525#7664161.
  • Joe created mw12 to test PHP 7.4 changes last November (T295578).
  • Something happens which causes deployment-prep mediawiki VMs to start struggling and fail to serve requests. Some people not familiar with the setup make hotfixes to keep the service running. This includes Ema making a change to point traffic MW traffic from mw11 to mw12, believing that mw11 was decomissioned and someone just forgot to change that. Confusion happens because mw12 was not intended to serve traffic and wasn't even receiving code updates.
  • I was asked to look at the deployment-prep mediawiki servers since they were constantly being down. I didn't have motivation to look into the root causes, so I just added a remap rule into the deployment-prep ATS mappings to send api requests to mw11 (which was at the time not receiving any traffic) to hopefully lower the load of both appservers.

But the source of the replacement line I don't see. There is another reference in modules/pontoon/files/traffic/hiera/cache.yaml though.

It's configured via the Horizon hiera settings. Settings set there are visible in code search, for example: https://codesearch.wmcloud.org/search/?q=deployment-mediawiki&i=nope&files=&excludeFiles=&repos=

Change 767492 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/puppet@production] deployment-prep: change the mwapi host back to mediawiki11

https://gerrit.wikimedia.org/r/767492

Change 767492 merged by Dzahn:

[operations/puppet@production] deployment-prep: change the mwapi host back to mediawiki11

https://gerrit.wikimedia.org/r/767492

AH00288: scoreboard is full, not at MaxRequestWorkers[mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers
Failed to fetch API response from enwiki. Error code {code}

I am still seeing a lot of the above errors in the beta logs, and beta is being unreliable again.

I ran the command from T302699#7740902, here is the output:

dwalden@deployment-cache-text06:~$ for i in {1..5}; do nc -zv -w 5 deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80; done
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress
Connection to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80 port [tcp/http] succeeded!
Connection to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80 port [tcp/http] succeeded!
Connection to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80 port [tcp/http] succeeded!
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress

Mentioned in SAL (#wikimedia-releng) [2022-03-07T10:53:11Z] <zabe> restarted apache on deployment-mediawiki11 # T302699

AH00288: scoreboard is full, not at MaxRequestWorkers[mpm_worker:error] [pid 6975:tid 139968673539200] AH00288: scoreboard is full, not at MaxRequestWorkers
Failed to fetch API response from enwiki. Error code {code}

I am still seeing a lot of the above errors in the beta logs, and beta is being unreliable again.

I ran the command from T302699#7740902, here is the output:

dwalden@deployment-cache-text06:~$ for i in {1..5}; do nc -zv -w 5 deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80; done
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress
Connection to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80 port [tcp/http] succeeded!
Connection to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80 port [tcp/http] succeeded!
Connection to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80 port [tcp/http] succeeded!
nc: connect to deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud port 80 (tcp) timed out: Operation now in progress

I had just filed T303165. This seems different, previously either beta cluster would be unreachable or reachable. Now all pages load just fine, but the API is consistently unreachable. Also, the API fails with a 503 while this task is about 502.

@Majavah are you sure T303165 is a dupe? That task is about api.php (and nothing else!) consistently producing a 503 while this task is about all pages intermittently producing a 502.

@Majavah are you sure T303165 is a dupe? That task is about api.php (and nothing else!) consistently producing a 503 while this task is about all pages intermittently producing a 502.

It's the same underlying cause (apache not being able to serve requests Varnish gives it). On deployment-prep, api.php and everything else are served by two separate servers as an attempt to reduce load.

In T302699#7756404, @Majavah wrote:

@Majavah are you sure T303165 is a dupe? That task is about api.php (and nothing else!) consistently producing a 503 while this task is about all pages intermittently producing a 502.

It's the same underlying cause (apache not being able to serve requests Varnish gives it). On deployment-prep, api.php and everything else are served by two separate servers as an attempt to reduce load.

Okay. Makes me curious why the different error codes though? If the api.php server fails it's 503 and if the other one fails it's 502 or something like that? Except intake-analytics T303160 produces 502.

in this case a 502 is emitted by ats-backend cause it isn't able to reach its backend server. The 503 is emitted by the app layer itself cause it's experiencing a temporary issue.

The cache layer usually emits 502s (Bad Gateway) or 504s (Gateway Timeout)

in this case a 502 is emitted by ats-backend cause it isn't able to reach its backend server. The 503 is emitted by the app layer itself cause it's experiencing a temporary issue.

The cache layer usually emits 502s (Bad Gateway) or 504s (Gateway Timeout)

Interesting. But doesn't this mean I was probably right? The server that does api.php broke, the app layer. That's not the fault of ATS, is it? It started working again btw, somewhere in the last few hours probably.

Mentioned in SAL (#wikimedia-releng) [2022-03-10T09:59:47Z] <dwalden> restarted apache on deployment-mediawiki11 # T302699

This is happening again. I am also seeing:

Request from 52.225.87.246 via deployment-cache-text06 deployment-cache-text06, Varnish XID 132323714
Error: 503, Backend fetch failed at Tue, 19 Apr 2022 17:05:29 GMT

I reran dwalden@deployment-cache-text06:~$ for i in {1..5}; do nc -zv -w 5 deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud 80; done and it returned 5 "succeeded", so no problems there.

zabe@deployment-mediawiki12:~$ sudo tail /var/log/apache2.log
Apr 19 17:13:55 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:13:56 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
Apr 19 17:13:57 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:01 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:02 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:04 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:05 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:07 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:11 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
Apr 19 17:14:15 deployment-mediawiki12 apache2[18335]: [mpm_worker:error] [pid 18335:tid 140671451640960] AH00288: scoreboard is full, not at MaxRequestWorkers
zabe@deployment-mediawiki12:~$

https://en.wikipedia.beta.wmflabs.org/wiki/Bert

Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2022-04-22 02:23:20 GMT

Happening again:

dwalden@deployment-mediawiki12:~$ sudo tail /var/log/apache2.log
Apr 27 13:58:00 deployment-mediawiki12 apache2[28065]: [proxy_fcgi:error] [pid 28065:tid 140670941505280] (70007)The timeout specified has expired: [client 172.16.1.181:38180] AH01075: Error dispatching request to : (polling), referer: https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page
Apr 27 13:58:00 deployment-mediawiki12 apache2[28007]: [proxy_fcgi:error] [pid 28007:tid 140670421419776] (70007)The timeout specified has expired: [client 172.16.1.181:38216] AH01075: Error dispatching request to : (polling), referer: https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page
Apr 27 13:58:00 deployment-mediawiki12 apache2[28007]: [proxy_fcgi:error] [pid 28007:tid 140670966683392] (70007)The timeout specified has expired: [client 172.16.1.181:38218] AH01075: Error dispatching request to : (polling), referer: https://de.m.wikipedia.beta.wmflabs.org/
Apr 27 13:58:00 deployment-mediawiki12 apache2[28065]: [proxy_fcgi:error] [pid 28065:tid 140670949897984] (70007)The timeout specified has expired: [client 172.16.1.181:38246] AH01075: Error dispatching request to : (polling)
Apr 27 13:58:00 deployment-mediawiki12 apache2[28065]: [proxy_fcgi:error] [pid 28065:tid 140670933112576] (70007)The timeout specified has expired: [client 172.16.1.181:38270] AH01075: Error dispatching request to : (polling)
Apr 27 13:58:00 deployment-mediawiki12 apache2[28007]: [proxy_fcgi:error] [pid 28007:tid 140670798894848] (70007)The timeout specified has expired: [client 172.16.1.181:38288] AH01075: Error dispatching request to : (polling)
Apr 27 13:58:02 deployment-mediawiki12 apache2[28007]: [proxy_fcgi:error] [pid 28007:tid 140670941505280] (70007)The timeout specified has expired: [client 172.16.1.181:38294] AH01075: Error dispatching request to : (polling)
Apr 27 13:58:03 deployment-mediawiki12 apache2[28065]: [proxy_fcgi:error] [pid 28065:tid 140671067330304] (70007)The timeout specified has expired: [client 172.16.1.181:38314] AH01075: Error dispatching request to : (polling), referer: https://commons.wikimedia.beta.wmflabs.org/
Apr 27 13:58:04 deployment-mediawiki12 apache2[28065]: [proxy_fcgi:error] [pid 28065:tid 140670715033344] (70007)The timeout specified has expired: [client 172.16.1.181:38334] AH01075: Error dispatching request to : (polling), referer: https://meta.wikimedia.beta.wmflabs.org/
Apr 27 13:58:11 deployment-mediawiki12 apache2[28007]: [proxy_fcgi:error] [pid 28007:tid 140670429812480] (70007)The timeout specified has expired: [client 172.16.1.181:38338] AH01075: Error dispatching request to : (polling)

Happening again. Nothing in the apache2.log on mediawiki12 since 11:55 (UTC?)

Happening again. Nothing in the apache2.log on mediawiki12 since 11:55 (UTC?)

And it's back again.

Krinkle closed this task as Resolved.EditedJun 2 2022, 1:49 AM
Krinkle subscribed.

Appears resolved. Unless there is a specific common cause recurring, I assume there's no need to keep the Feb 2022 ticket open.