Page MenuHomePhabricator

Beta cluster Error: 502, Next Hop Connection Failed
Closed, ResolvedPublic

Description

https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster_502

Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2022-08-16 17:18:02 GMT

post inadvertent WMCS VM reboots

Related Objects

Event Timeline

Hm

Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)

(did sudo systemctl restart puppet-master.service)

Looking at https://beta-logs.wmcloud.org/goto/fe9f15115065505a886e3b9fd7f4413b there's a lot of

WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider::getLags: Request to Prometheus API http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated failed with * HTTP request timed out.
* There was a problem during the HTTP request: 0 Error

and

Unexpected connection error communicating with Elasticsearch. Curl code: 35

I've rebooted:

  • deployment-puppetmaster04 (per T315350#8158462, resolved)
  • deployment-prometheus02
  • deployment-elastic10

I'm out of ideas, at the edge of my knowledge and don't want to make things worse by guessing — help! 🙂

TheresNoTime raised the priority of this task from High to Unbreak Now!.Aug 16 2022, 7:37 PM

Bumping to UBN!

RhinosF1 subscribed.

Hi Search Team, can you help with the WDQS errors?

Brian (Search platform SRE) here.

Apologies for the confusion: to the best of my knowledge, there is no WDQS in beta cluster.

UPDATED per @RhinosF1 observation:

I did take a look at the Elasticsearch beta cluster instances and I found that the cluster status is healthy, but deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud has a bad cert, CN is deployment-elastic09.deployment-prep.eqiad.wmflabs, an outdated name that I will fix in a different task. However, I don't believe this is a factor in the current outage.

Please let us know if we can do anything else to help.

[2022-08-16T23:25:49Z, #wikimedia-releng, @Zabe]
<zabe> I try to understand why apache is not willing to start
<zabe> AH00526: Syntax error on line 1 of /etc/apache2/conf-enabled/50-configmaster-port.conf
<zabe> Cannot define multiple Listeners on the same IP:port


The Puppet repo on deployment-puppetmaster04:/var/lib/git/operations/puppet is in MERGING state. There's an unresolved conflict in modules/profile/manifests/etcd/v3.pp. The conflict is between the upstream change I04aa7729e and a local patch, Iecfc26a94, which has been cherry-picked locally for the past year but never merged upstream.

git status and git diff here: P32410


[2022-08-17T00:00:57Z, #wikimedia-releng, @Zabe]
<zabe> some progress: https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ made puppet on puppetmaster04 run again

[2022-08-17T00:43:52Z, #wikimedia-releng, @ori]
<ori> OK I hacked in locally if $::realm != 'labs' {} around the ::esitest in profile/manifests/cache/varnish/frontend/text.pp
<ori> I'll turn it into a patch shortly
<ori> puppet ran successfully on deployment-cache-text06

Change 823766 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] beta cluster: don't instantiate ::esitest

https://gerrit.wikimedia.org/r/823766

[2022-08-17T00:49:52Z, #wikimedia-releng, @thcipriani]

<thcipriani> what I've managed to find: it seems like mediawiki is working locally on the MediaWiki hosts. But I get timeouts on the -cache-text server. Traffic server is complaining about varnish-frontend, but varnish isn't reporting errors afaict.
<thcipriani> welp. Restarting trafficserver seems to have fixed it.
<thcipriani> I note there are no errors in journalctl -u trafficserver and the systemd unit was "running"

https://meta.wikimedia.beta.wmflabs.org/wiki/Main_Page loads correctly

TheresNoTime lowered the priority of this task from Unbreak Now! to Medium.Aug 17 2022, 12:03 AM
TheresNoTime updated the task description. (Show Details)

Follow-up items to get the Puppet repo on deployment-puppetmaster04 in good shape:

Change 824146 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud

https://gerrit.wikimedia.org/r/824146

Hi there. JFYI, after the changes early this morning in deployment-puppetmaster04 Puppet started failing on the beta deployment server:

Aug 17 00:23:40 deployment-deploy03 puppet-agent[13622]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud
Aug 17 00:53:51 deployment-deploy03 puppet-agent[15700]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud

I've cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/823211/ into /var/lib/git/labs/private on deployment-puppetmaster04 and the agent seems happy again (fingers crossed)

Hi there. JFYI, after the changes early this morning in deployment-puppetmaster04 Puppet started failing on the beta deployment server:

Just to confirm puppet was failing on deployment-deploy03 not on deployment-puppetmaster04?

Aug 17 00:23:40 deployment-deploy03 puppet-agent[13622]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud
Aug 17 00:53:51 deployment-deploy03 puppet-agent[15700]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud

this should have been fixed by https://gerrit.wikimedia.org/r/c/labs/private/+/823209

I've cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/823211/ into `/var/lib/git/labs/private` on `deployment-puppetmaster04`  and the agent seems happy again (fingers crossed)

This dosn't make much senses, perhaps this is the wrong link? this CR applies to a change to the operation puppet policy not the private repo. however i dont think we should need to cherry pick the private repo change as its already been merged.

Just to confirm puppet was failing on deployment-deploy03 not on deployment-puppetmaster04?

Correct, deployment-deploy03 is the beta deployment server. The failures started right after the Puppet changes in deployment-puppetmaster04

this should have been fixed by https://gerrit.wikimedia.org/r/c/labs/private/+/823209

Yes, sorry, I pasted the wrong URL, the relevant patch was 823209 in labs/private. As far as I can tell, /var/lib/git/labs/private is being maintained manually judging by the commits there, there's even quite a few marked with [LOCAL]. Nevertheless, I don't really know how we handle it normally.

Correct, deployment-deploy03 is the beta deployment server. The failures started right after the Puppet changes in deployment-puppetmaster04

Ack i suspect that could be because ori fixed the merge conflicts which brought in a bunch more changes

Yes, sorry, I pasted the wrong URL, the relevant patch was 823209 in labs/private. As far as I can tell, /var/lib/git/labs/private is being maintained manually judging by the commits there, there's even quite a few marked with [LOCAL]. Nevertheless, I don't really know how we handle it normally.

My understanding is that only real secrets should be overlaid on top of the private repo and fake/mock secrets should be pulled in from labs/private but im not super familiar with deployment-prep myself so will leave for others to confirm/clarify

Ok, lemme try to quickly summarize what happened and what was done.

Some cloudvirts hosts got accidentally rebooted which caused deployment-prep to go offline and it did not came back up by itself.

Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2022-08-16 17:18:02 GMT

Simply restarting apache on the app servers and traffic server on the cache servers did not seem to fix the problem.

TNT noticed errors showing up when running logspam-watch on deployment-mwlog01, see T315379. These fix for those was https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453, which was missing on that host, so apparently puppet was no longer running.

samtar@deployment-puppetmaster04:~$ sudo run-puppet-agent
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Info: Loading facts
Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]

I fixed this failure by regenerating the certificates. I followed https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster for that.

There was another reason for puppet not updating and that was that there was a merge conflict from a patch with a local cherry-pick, see P32410. That merge conflict was fixed by ori, see T315395 for follow-up.

At this point puppet was still not running with the following failure.

zabe@deployment-puppetmaster04:~$ sudo run-puppet-agent
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Info: Loading facts
Error: Could not retrieve catalog from remote server: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)

After some digging, it turned out that apache was not running and it refused to start.

<zabe> AH00526: Syntax error on line 1 of /etc/apache2/conf-enabled/50-configmaster-port.conf
<zabe> Cannot define multiple Listeners on the same IP:port

https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ fixed this, seems to have been a problem that came up somewhere in https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222, https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631.

At this stage puppet was finally running on puppetmaster, but it wasn't on deployment-cache-text06.

samtar@deployment-cache-text06:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Trafficserver] is already declared at (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251); cannot redeclare (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251) (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251, column: 5) (file: /etc/puppet/modules/profile/manifests/trafficserver/tls.pp, line: 168) on node deployment-cache-text06.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Ori fixed this by reverting some puppet patches which seem to be incompatible with beta at this stage, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639. The follow-up for this is T315394. They also disabled ::esitest on deployment-prep, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/823766/.

At this point puppet was finally running everywhere, but beta cluster was still unreachable. A simple restart of trafficserver on deployment-cache-text06 fixed that.

Change 823766 abandoned by Ori:

[operations/puppet@production] beta cluster: don't instantiate ::esitest

Reason:

I think this is obsoleted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/824146 , and indeed it's no longer cherry-picked on beta.

https://gerrit.wikimedia.org/r/823766

Krinkle renamed this task from Known, Beta cluster Error: 502, Next Hop Connection Failed to Beta cluster Error: 502, Next Hop Connection Failed.Aug 18 2022, 2:54 PM
Gehel subscribed.

This does not seem to be related to Search / WDQS, so I'll untag the Search Platform team. Ping us again if you need us!

jbond claimed this task.

@Zabe thanks for the detailed summary, it looks like this issues is now resolved so ill close this task but please reopen if there is still something outstanding