Beta cluster Error: 502, Next Hop Connection Failed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TheresNoTime
	Aug 16 2022, 5:21 PM

Description

https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster_502

Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2022-08-16 17:18:02 GMT

post inadvertent WMCS VM reboots

Details

	Subject	Repo	Branch	Lines +/-
	beta cluster: don't instantiate ::esitest	operations/puppet	production	+4 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		jbond	T315350 Beta cluster Error: 502, Next Hop Connection Failed
Invalid	PRODUCTION ERROR	None	T315354 (Beta Cluster) Unexpected connection error communicating with Elasticsearch. Curl code: {curl_code}
Invalid	PRODUCTION ERROR	None	T315355 (Beta cluster) Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded
Resolved		None	T315351 Evaluation Error on deployment-cache-text06 puppet run
Resolved		Zabe	T315379 (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors
Resolved		jbond	T315394 Remove two cherry-picked reverts from deployment-puppetmaster04
Declined		jbond	T315395 Rebase & merge or re-cherry-pick 668701 on deployment-puppetmaster04

Event Timeline

TheresNoTime created this task.Aug 16 2022, 5:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 16 2022, 5:21 PM

Daimona subscribed.Aug 16 2022, 5:21 PM

ppelberg subscribed.Aug 16 2022, 5:22 PM

matmarex added a parent task: T312253: 502 errors on beta cluster.Aug 16 2022, 5:24 PM

matmarex subscribed.

Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)

(did sudo systemctl restart puppet-master.service)

TheresNoTime triaged this task as High priority.Aug 16 2022, 5:46 PM

Ryasmeen subscribed.Aug 16 2022, 5:56 PM

TheresNoTime added a subtask: T315354: (Beta Cluster) Unexpected connection error communicating with Elasticsearch. Curl code: {curl_code}.Aug 16 2022, 6:13 PM

TheresNoTime added a subtask: T315355: (Beta cluster) Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded.

Looking at https://beta-logs.wmcloud.org/goto/fe9f15115065505a886e3b9fd7f4413b there's a lot of

WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider::getLags: Request to Prometheus API http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated failed with * HTTP request timed out.
* There was a problem during the HTTP request: 0 Error

and

Unexpected connection error communicating with Elasticsearch. Curl code: 35

I've rebooted:

deployment-puppetmaster04 (per T315350#8158462, resolved)
deployment-prometheus02
deployment-elastic10

I'm out of ideas, at the edge of my knowledge and don't want to make things worse by guessing — help! 🙂

Bumping to UBN!

Hi Search Team, can you help with the WDQS errors?

bking subscribed.Aug 16 2022, 8:02 PM

Brian (Search platform SRE) here.

Apologies for the confusion: to the best of my knowledge, there is no WDQS in beta cluster.

UPDATED per @RhinosF1 observation:

I did take a look at the Elasticsearch beta cluster instances and I found that the cluster status is healthy, but deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud has a bad cert, CN is deployment-elastic09.deployment-prep.eqiad.wmflabs, an outdated name that I will fix in a different task. However, I don't believe this is a factor in the current outage.

Please let us know if we can do anything else to help.

Maintenance_bot added a project: Wikidata.Aug 16 2022, 8:29 PM

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptAug 16 2022, 8:29 PM

TheresNoTime added a project: Release-Engineering-Team.Aug 16 2022, 8:42 PM

Mentioned in SAL (#wikimedia-releng) [2022-08-16T20:51:48Z] <RhinosF1> beta: is down see wikitech-l and https://phabricator.wikimedia.org/T315350

ArielGlenn subscribed.Aug 16 2022, 8:55 PM

RhinosF1 added a subtask: T315351: Evaluation Error on deployment-cache-text06 puppet run .Aug 16 2022, 8:55 PM

Jdforrester-WMF removed a parent task: T312253: 502 errors on beta cluster.Aug 16 2022, 9:26 PM

bking mentioned this in T315386: Replace certificate on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud .Aug 16 2022, 10:28 PM

[2022-08-16T23:25:49Z, #wikimedia-releng, @Zabe]
<zabe> I try to understand why apache is not willing to start
<zabe> AH00526: Syntax error on line 1 of /etc/apache2/conf-enabled/50-configmaster-port.conf
<zabe> Cannot define multiple Listeners on the same IP:port

In T315379#8159822, @ori wrote:

The Puppet repo on deployment-puppetmaster04:/var/lib/git/operations/puppet is in MERGING state. There's an unresolved conflict in modules/profile/manifests/etcd/v3.pp. The conflict is between the upstream change I04aa7729e and a local patch, Iecfc26a94, which has been cherry-picked locally for the past year but never merged upstream.

git status and git diff here: P32410

[2022-08-17T00:00:57Z, #wikimedia-releng, @Zabe]
<zabe> some progress: https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ made puppet on puppetmaster04 run again

TheresNoTime added a subtask: T315379: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors.Aug 16 2022, 11:04 PM

TheresNoTime closed subtask T315354: (Beta Cluster) Unexpected connection error communicating with Elasticsearch. Curl code: {curl_code} as Invalid.

TheresNoTime mentioned this in T315354: (Beta Cluster) Unexpected connection error communicating with Elasticsearch. Curl code: {curl_code}.

TheresNoTime closed subtask T315355: (Beta cluster) Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded as Invalid.

TheresNoTime mentioned this in T315355: (Beta cluster) Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded.

[2022-08-17T00:43:52Z, #wikimedia-releng, @ori]
<ori> OK I hacked in locally if $::realm != 'labs' {} around the ::esitest in profile/manifests/cache/varnish/frontend/text.pp
<ori> I'll turn it into a patch shortly
<ori> puppet ran successfully on deployment-cache-text06

Change 823766 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] beta cluster: don't instantiate ::esitest

https://gerrit.wikimedia.org/r/823766

gerritbot added a project: Patch-For-Review.Aug 16 2022, 11:57 PM

[2022-08-17T00:49:52Z, #wikimedia-releng, @thcipriani]

<thcipriani> what I've managed to find: it seems like mediawiki is working locally on the MediaWiki hosts. But I get timeouts on the -cache-text server. Traffic server is complaining about varnish-frontend, but varnish isn't reporting errors afaict.
<thcipriani> welp. Restarting trafficserver seems to have fixed it.
<thcipriani> I note there are no errors in journalctl -u trafficserver and the systemd unit was "running"

https://meta.wikimedia.beta.wmflabs.org/wiki/Main_Page loads correctly

TheresNoTime lowered the priority of this task from Unbreak Now! to Medium.Aug 17 2022, 12:03 AM

TheresNoTime updated the task description. (Show Details)

Follow-up items to get the Puppet repo on deployment-puppetmaster04 in good shape:

The two cherry-picked reverts should be removed (https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638, https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639) and the changes they revert should be updated to not be incompatible with the Beta Cluster.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/668701 needs to be rebased and merged or re-cherry-picked on deployment-puppetmaster04. To get it to apply I had to manually resolve a conflict, and I'm not sure I did it correctly. So the actual diff on deployment-puppetmaster04 is not consistent with what's on Gerrit.

TheresNoTime mentioned this in T315394: Remove two cherry-picked reverts from deployment-puppetmaster04.Aug 17 2022, 12:12 AM

TheresNoTime added a subtask: T315394: Remove two cherry-picked reverts from deployment-puppetmaster04.

TheresNoTime mentioned this in T315395: Rebase & merge or re-cherry-pick 668701 on deployment-puppetmaster04.

TheresNoTime added a subtask: T315395: Rebase & merge or re-cherry-pick 668701 on deployment-puppetmaster04.

Zabe closed subtask T315379: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors as Resolved.Aug 17 2022, 12:19 AM

jbond subscribed.Aug 17 2022, 8:36 AM

Change 824146 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud

https://gerrit.wikimedia.org/r/824146

Hi there. JFYI, after the changes early this morning in deployment-puppetmaster04 Puppet started failing on the beta deployment server:

Aug 17 00:23:40 deployment-deploy03 puppet-agent[13622]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud
Aug 17 00:53:51 deployment-deploy03 puppet-agent[15700]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud

I've cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/823211/ into /var/lib/git/labs/private on deployment-puppetmaster04 and the agent seems happy again (fingers crossed)

TheresNoTime awarded a token.Aug 17 2022, 11:29 AM

In T315350#8161180, @jnuche wrote:

Hi there. JFYI, after the changes early this morning in deployment-puppetmaster04 Puppet started failing on the beta deployment server:

Just to confirm puppet was failing on deployment-deploy03 not on deployment-puppetmaster04?

Aug 17 00:23:40 deployment-deploy03 puppet-agent[13622]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud
Aug 17 00:53:51 deployment-deploy03 puppet-agent[15700]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret scap/phabricator_token (file: /etc/puppet/modules/scap/manifests/master.pp, line: 107, column: 22) on node deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud

this should have been fixed by https://gerrit.wikimedia.org/r/c/labs/private/+/823209

I've cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/823211/ into `/var/lib/git/labs/private` on `deployment-puppetmaster04`  and the agent seems happy again (fingers crossed)

This dosn't make much senses, perhaps this is the wrong link? this CR applies to a change to the operation puppet policy not the private repo. however i dont think we should need to cherry pick the private repo change as its already been merged.

Just to confirm puppet was failing on deployment-deploy03 not on deployment-puppetmaster04?

Correct, deployment-deploy03 is the beta deployment server. The failures started right after the Puppet changes in deployment-puppetmaster04

this should have been fixed by https://gerrit.wikimedia.org/r/c/labs/private/+/823209

Yes, sorry, I pasted the wrong URL, the relevant patch was 823209 in labs/private. As far as I can tell, /var/lib/git/labs/private is being maintained manually judging by the commits there, there's even quite a few marked with [LOCAL]. Nevertheless, I don't really know how we handle it normally.

Correct, deployment-deploy03 is the beta deployment server. The failures started right after the Puppet changes in deployment-puppetmaster04

Ack i suspect that could be because ori fixed the merge conflicts which brought in a bunch more changes

Yes, sorry, I pasted the wrong URL, the relevant patch was 823209 in labs/private. As far as I can tell, /var/lib/git/labs/private is being maintained manually judging by the commits there, there's even quite a few marked with [LOCAL]. Nevertheless, I don't really know how we handle it normally.

My understanding is that only real secrets should be overlaid on top of the private repo and fake/mock secrets should be pulled in from labs/private but im not super familiar with deployment-prep myself so will leave for others to confirm/clarify

Ok, lemme try to quickly summarize what happened and what was done.

Some cloudvirts hosts got accidentally rebooted which caused deployment-prep to go offline and it did not came back up by itself.

Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2022-08-16 17:18:02 GMT

Simply restarting apache on the app servers and traffic server on the cache servers did not seem to fix the problem.

TNT noticed errors showing up when running logspam-watch on deployment-mwlog01, see T315379. These fix for those was https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453, which was missing on that host, so apparently puppet was no longer running.

samtar@deployment-puppetmaster04:~$ sudo run-puppet-agent
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Info: Loading facts
Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]

I fixed this failure by regenerating the certificates. I followed https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster for that.

There was another reason for puppet not updating and that was that there was a merge conflict from a patch with a local cherry-pick, see P32410. That merge conflict was fixed by ori, see T315395 for follow-up.

At this point puppet was still not running with the following failure.

zabe@deployment-puppetmaster04:~$ sudo run-puppet-agent
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Info: Loading facts
Error: Could not retrieve catalog from remote server: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)

After some digging, it turned out that apache was not running and it refused to start.

<zabe> AH00526: Syntax error on line 1 of /etc/apache2/conf-enabled/50-configmaster-port.conf
<zabe> Cannot define multiple Listeners on the same IP:port

https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ fixed this, seems to have been a problem that came up somewhere in https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222, https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631.

At this stage puppet was finally running on puppetmaster, but it wasn't on deployment-cache-text06.

samtar@deployment-cache-text06:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Trafficserver] is already declared at (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251); cannot redeclare (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251) (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251, column: 5) (file: /etc/puppet/modules/profile/manifests/trafficserver/tls.pp, line: 168) on node deployment-cache-text06.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Ori fixed this by reverting some puppet patches which seem to be incompatible with beta at this stage, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639. The follow-up for this is T315394. They also disabled ::esitest on deployment-prep, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/823766/.

At this point puppet was finally running everywhere, but beta cluster was still unreachable. A simple restart of trafficserver on deployment-cache-text06 fixed that.

https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster_502

RhinosF1 added a project: SRE-OnFire.Aug 17 2022, 1:29 PM

RhinosF1 moved this task from Backlog to Pending Review & Scorecard on the SRE-OnFire board.

Zabe closed subtask T315351: Evaluation Error on deployment-cache-text06 puppet run as Resolved.Aug 17 2022, 2:28 PM

jbond closed subtask T315394: Remove two cherry-picked reverts from deployment-puppetmaster04 as Resolved.Aug 17 2022, 2:46 PM

Change 823766 abandoned by Ori:

[operations/puppet@production] beta cluster: don't instantiate ::esitest

Reason:

I think this is obsoleted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/824146 , and indeed it's no longer cherry-picked on beta.

https://gerrit.wikimedia.org/r/823766

Krinkle renamed this task from Known, Beta cluster Error: 502, Next Hop Connection Failed to Beta cluster Error: 502, Next Hop Connection Failed.Aug 18 2022, 2:54 PM

This does not seem to be related to Search / WDQS, so I'll untag the Search Platform team. Ping us again if you need us!

zeljkofilipin subscribed.Aug 23 2022, 12:25 PM

• AhmadSabriihamzahh92 removed projects: SRE-OnFire, Wikimedia-Incident, Patch-For-Review, Release-Engineering-Team, [DEPRECATED] wdwb-tech, Wikidata, Beta-Cluster-Infrastructure.Aug 31 2022, 6:26 PM

• AhmadSabriihamzahh92 updated the task description. (Show Details)

Zabe added projects: SRE-OnFire, Wikimedia-Incident, Release-Engineering-Team, Beta-Cluster-Infrastructure.Aug 31 2022, 6:38 PM

@Zabe thanks for the detailed summary, it looks like this issues is now resolved so ill close this task but please reopen if there is still something outstanding

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptSep 7 2022, 10:01 AM

• taavi closed subtask T315395: Rebase & merge or re-cherry-pick 668701 on deployment-puppetmaster04 as Declined.Mar 21 2023, 9:33 AM

Beta cluster Error: 502, Next Hop Connection FailedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Beta cluster Error: 502, Next Hop Connection Failed
Closed, ResolvedPublic
Actions

Related Objects
Search...