Page MenuHomePhabricator

Figure out why OpenSearch operational scripts frequently fail to connect
Closed, ResolvedPublic

Description

Since we migrated everything to OpenSearch, I've noticed that my operational scripts (such as es-maint-viewer and cirrusssearch_shard_checker ) frequently fail to connect to the endpoint (such as search.svc.eqiad.wmnet).

Creating this ticket to:

  • Investigate the failures
  • Make the necessary changes (script updates or otherwise) to restore reliability

Event Timeline

Theories from most to least likely:

  • Bug(s) in the script
  • We added the wrong hosts to the load balancer pool (for example, we put an omega host in a psi pool)
  • One or more of the new hosts' network device is missing a VLAN, similar to the situation described here
bking triaged this task as Medium priority.

Another theory based on IRC conversation with @EBernhardson , is that our LVS health checks are not working. For example, cirrussearch2091 has a hardware issue (ref T388610) and yet it still shows as enabled.

I think we can rule out bugs in the script, this simple script reproduces the problem:

import requests

for i in range(100):
    requests.get('https://search.svc.eqiad.wmnet:9243', timeout=(0.1,1)).raise_for_status()

Within 20-50 requests it typically errors:

Traceback (most recent call last):       
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 162, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err                                
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 70, in create_connection                                                                          sock.connect(sa)                                                                                                                                                  
OSError: [Errno 113] No route to host
                
During handling of the above exception, another exception occurred:
                                       
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 846, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 315, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 171, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7feef627fcc0>: Failed to establish a new connection: [Errno 113] No route to host

Per IRC conversation with @cmooney , at least some of these errors can be traced back to a missing VLAN on lvs2019. We should probably resurrect T363702 and fix up the script we created to test this in the past.

Change #1144666 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts

https://gerrit.wikimedia.org/r/1144666

Yeah I expect it's whenever the LVS tries to select a back-end in eqiad racks E8 or F8.

These went live recently and it looks like the vlan interaces were not created on the LVS. I've prepped a patch which should fix it.

Change #1144666 merged by Cathal Mooney:

[operations/puppet@production] Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts

https://gerrit.wikimedia.org/r/1144666

Mentioned in SAL (#wikimedia-operations) [2025-05-13T00:31:56Z] <sukhe> run agent on A:lvs-eqiad to re-enable puppet: T393911

Change #1145098 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] Revert^2 "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts"

https://gerrit.wikimedia.org/r/1145098

Change #1145099 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts

https://gerrit.wikimedia.org/r/1145099

Change #1145098 abandoned by Vgutierrez:

[operations/puppet@production] Revert^2 "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts"

Reason:

work continuing on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145099

https://gerrit.wikimedia.org/r/1145098

Change #1145099 merged by Vgutierrez:

[operations/puppet@production] lvs: add eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts

https://gerrit.wikimedia.org/r/1145099

Mentioned in SAL (#wikimedia-operations) [2025-05-13T09:00:46Z] <vgutierrez> rolling reboot of eqiad load balancers to add E8/F8 interfaces - T393911 | T382017

@bking @EBernhardson the missing vlans have now been added on lvs1019 (and the other LVS in eqiad).

I'm not sure if the nodes in racks E8/F8 are still pooled, certainly if I re-try the above test the 100 requests all seem to work.

cmooney@mwmaint1002:~$ python3
Python 3.7.3 (default, Mar 23 2024, 16:12:05) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> 
>>> for i in range(100):
...     requests.get('https://search.svc.eqiad.wmnet:9243', timeout=(0.1,1)).raise_for_status()
... 
>>>
bking changed the task status from Open to In Progress.May 13 2025, 2:48 PM
bking closed this task as Resolved.EditedMay 13 2025, 3:57 PM

I can confirm that the changes above have stopped the errors from our LVS pools. Long-term, it seems more productive to move our LVS pools to IPIP rather than worry about monitoring for L2 connectivity, since the L2 approach will eventually be retired.

As such, I'm closing out this ticket. See T394062 for the IPIP migration plans.