Page MenuHomePhabricator

Beta Cluster MediaWiki updates require logging-logstash-02.logging.eqiad1.wikimedia.cloud to allow access to port 9200 by `scap`
Open, In Progress, HighPublic

Description

Problem

scap sync-world is running into the following error during the canary error rate check phase:

Generic connection error: HTTPConnectionPool(host='logging-logstash-02.logging.eqiad1.wikimedia.cloud', port=9200): Max retries exceeded with url: /logstash-*/_search (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcde8572fa0>: Failed to establish a new connection: [Errno 113] No route to host'))

This started happening around Nov 5, 2025 17:55:12 UTC.

As an alternate I tried connecting to logging-logstash-03.logging.eqiad1.wikimedia.cloud:9200 but that never completes:

dancy@deployment-deploy04:~$ curl logging-logstash-03.logging.eqiad1.wikimedia.cloud:9200
curl: (28) Failed to connect to logging-logstash-03.logging.eqiad1.wikimedia.cloud port 9200: Connection timed out

Impact

https://integration.wikimedia.org/ci/job/beta-scap-sync-world is broken.

Event Timeline

dancy triaged this task as High priority.Nov 5 2025, 7:22 PM
dancy updated the task description. (Show Details)
dancy updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-releng) [2025-11-05T19:28:26Z] <dancy> Disabled beta-scap-sync-world job (T409339)

https://openstack-browser.toolforge.org/server/logging-logstash-02.logging.eqiad1.wikimedia.cloud reports the logging-logstash-02.logging.eqiad1.wikimedia.cloud instance as SHUTOFF. The Horizon action log there says that @colewhite shut the instance down Nov. 5, 2025, 5:56 p.m.

As an alternate I tried connecting to logging-logstash-03.logging.eqiad1.wikimedia.cloud:9200 but that never completes:

The logging-logstash-03.logging.eqiad1.wikimedia.cloud instance is missing the "scap-access" security group that is applied to logging-logstash-02. That security group opens up port 9200 access specifically to 172.16.1.63 (deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud) and 172.16.4.233 (NXDOMAIN).

bd808 renamed this task from Scap can't connect to logging-logstash-02.logging.eqiad1.wikimedia.cloud in beta to Beta Cluster MediaWiki updates blocked because scap can't connect to logging-logstash-02.logging.eqiad1.wikimedia.cloud in beta.Nov 5 2025, 9:00 PM

Mentioned in SAL (#wikimedia-releng) [2025-11-05T21:07:17Z] <bd808> Manually triggered beta-scap-sync-world to test T409339

[21:03]  <   cwhite> bd808: I powereed it back on to unblock.  Let's look into pointing scap at a newer host. :)
[21:04]  <    bd808> add the needed network ACL to another host and we can do that :)
[21:05]  <    bd808> It might be nice to have a service name to point Beta Cluster at rather than a single instance too.

Leaving this as high for now as the current fix is temporary.

logging-logstash-04 looks to be the newest logstash instance in the logging project (~3 months old). It will need the scap-access security group added to allow deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud to talk to it.

That connection will be specific to those two hosts and need updating in the future when the deploy server is replaced in Beta Cluster because the firewall hole is specifically pinned to the 172.16.1.63 IPv4. The next generation of deployment hosts in Beta Cluster will have IPv6 connectivity as well requiring additional origin host rules. I'm not sure what threat model is being used, but opening port 9200 to all of Cloud VPS would be more future proof at the expense of potential access from outside Beta Cluster.

bd808 renamed this task from Beta Cluster MediaWiki updates blocked because scap can't connect to logging-logstash-02.logging.eqiad1.wikimedia.cloud in beta to Beta Cluster MediaWiki updates require logging-logstash-02.logging.eqiad1.wikimedia.cloud to allow access to port 9200 by `scap`.Nov 5 2025, 9:18 PM

Change #1202295 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] scap: use new logging-logstash in deployment-prep

https://gerrit.wikimedia.org/r/1202295

Change #1202295 merged by Cwhite:

[operations/puppet@production] scap: use new logging-logstash in deployment-prep

https://gerrit.wikimedia.org/r/1202295

The network ACLs and the scap config were updated to use the newer host: logging-logstash-04.

I'll try shutting logging-logstash-02 down again tomorrow in hopes that it's ready for decommissioning. Thanks for the heads up!

Mentioned in SAL (#wikimedia-releng) [2025-11-05T21:53:08Z] <bd808> Forced puppet run on deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud to pick up new scap config (T409339)

bd808 changed the task status from Open to In Progress.Nov 5 2025, 9:54 PM
bd808 assigned this task to colewhite.
bd808 moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.