Page MenuHomePhabricator

Define LVS load-balancing for OpenSearch cluster
Closed, ResolvedPublic

Description

We are going to use a three node OpenSearch cluster for DataHub.

These VMs will be running on Ganeti VMs at least for the duration of the MVP.

We will need to use LVS and pybal/conftool in order to load-balance across these machines and provide high-availability.

This ticket tracks the work to create this load-balancer configuration.

Event Timeline

Milimetric moved this task from Next Up to Backlog on the Data-Catalog board.
BTullis triaged this task as High priority.Mar 2 2022, 10:39 AM

Now starting to work on this task. I think the first thing I'll have to do is as for a service IP to be allocated.

As per instruction from @ayounsi I have also reserved the corresponding address in codfw, in case the service ever becomes available in both DCs.

image.png (604×925 px, 36 KB)

Change 768663 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add a record for datahubsearch service

https://gerrit.wikimedia.org/r/768663

This is the required DNS change to add the service name: datahubsearch.svc.eqiad.wmnet
https://gerrit.wikimedia.org/r/c/operations/dns/+/768663

This is the required change to start setting up the LVS configuration:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/768668

I have now merged that change and applied it to the datahubsearch servers.
We can see that it now has the realserver IP address (10.2.2.71) as an alias on the lo interface.

Notice: Applied catalog in 29.09 seconds
btullis@datahubsearch1001:~$ ip a sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.2.2.71/32 scope global lo:LVS
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

The service is in the service_setup state but it is almost ready to go into the lvs_setup state.

Change 769398 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move datahubsearch service from service_setup to lvs_setup

https://gerrit.wikimedia.org/r/769398

Change 769398 merged by Btullis:

[operations/puppet@production] Move datahubsearch service from service_setup to lvs_setup

https://gerrit.wikimedia.org/r/769398

Mentioned in SAL (#wikimedia-operations) [2022-03-09T13:50:59Z] <btullis> restarting pybal on lvs102 T301458

Change 768663 merged by Btullis:

[operations/dns@master] Add a record for datahubsearch service

https://gerrit.wikimedia.org/r/768663

Mentioned in SAL (#wikimedia-operations) [2022-03-09T13:59:11Z] <btullis> restarting pybal on lvs1019 T301458

I'm not sure yet why it's not working.

This is my test against the backend server:

btullis@lvs1019:~$ curl http://datahubsearch.svc.eqiad.wmnet:9200/_cat/health
curl: (7) Failed to connect to datahubsearch.svc.eqiad.wmnet port 9200: Connection timed out

but we can talk to the backend servers:

btullis@lvs1019:~$ curl http://datahubsearch1001.eqiad.wmnet:9200/_cat/health
1646835058 14:10:58 datahub green 3 3 true 0 0 0 0 0 0 - 100.0%

The realserver IP address (10.2.2.71/32) is present on the loopback address of the backend hosts:

btullis@datahubsearch1001:~$ ip a sh lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.2.2.71/32 scope global lo:LVS
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

OK, all I had to do was to restart the opensearch_1@datahub service on the three datahubsearch servers.
They were already configured with network.host: [_local_,_site_] which serves opensearch on all local IP addresses on startup, but it doesn't pick up addresses that were added after startup as in this case.

Now my tests work as expected:

btullis@aqs1010:~$ curl http://datahubsearch.svc.eqiad.wmnet:9200/_cat/health
1646837793 14:56:33 datahub green 3 3 true 0 0 0 0 0 0 - 100.0%

I can put it into the monitoring_setup state next.

Change 769451 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add monitoring for the datahubsearch LVS service

https://gerrit.wikimedia.org/r/769451

Change 769451 merged by Btullis:

[operations/puppet@production] Add monitoring for the datahubsearch LVS service

https://gerrit.wikimedia.org/r/769451

I have moved this to the monitoring_setup state, so the cluster will be monitored by Icinga, but it will not page. I think that this is the best state for it to be in during this phase of MVP development, but I will check whether it is OK to leave it like this for a while.

Change 770471 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the monitoring check for datahubsearch

https://gerrit.wikimedia.org/r/770471

Change 770471 merged by Btullis:

[operations/puppet@production] Update the monitoring check for datahubsearch

https://gerrit.wikimedia.org/r/770471

Change 770475 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add single quotes around the regex to use

https://gerrit.wikimedia.org/r/770475

Change 770475 merged by Btullis:

[operations/puppet@production] Add single quotes around the regex to use

https://gerrit.wikimedia.org/r/770475

The monitoring check in Icinga for this service is now fixed.