Define LVS load-balancing for OpenSearch cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Feb 10 2022, 12:01 PM

Description

We are going to use a three node OpenSearch cluster for DataHub.

These VMs will be running on Ganeti VMs at least for the duration of the MVP.

We will need to use LVS and pybal/conftool in order to load-balance across these machines and provide high-availability.

This ticket tracks the work to create this load-balancer configuration.

Details

Subject	Repo	Branch	Lines +/-
Add single quotes around the regex to use	operations/puppet	production	+1 -1
Update the monitoring check for datahubsearch	operations/puppet	production	+1 -1
Add monitoring for the datahubsearch LVS service	operations/puppet	production	+1 -1
Add a record for datahubsearch service	operations/dns	master	+3 -0
Move datahubsearch service from service_setup to lvs_setup	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T299910 Data Catalog MVP
Resolved	• razzi	T301382 Set up opensearch cluster for datahub
Resolved	BTullis	T301458 Define LVS load-balancing for OpenSearch cluster

Event Timeline

BTullis created this task.Feb 10 2022, 12:01 PM

BTullis edited parent tasks, added: T299910: Data Catalog MVP; removed: T301385: Deploy DataHub in MVP phase.Feb 11 2022, 2:52 PM

BTullis added a parent task: T301382: Set up opensearch cluster for datahub.

Milimetric moved this task from Backlog to Next Up on the Data-Catalog board.Feb 14 2022, 4:51 PM

Milimetric moved this task from Next Up to Backlog on the Data-Catalog board.

odimitrijevic moved this task from Incoming (new tickets) to Security & Governance on the Data-Engineering board.Feb 14 2022, 4:53 PM

• EChetty moved this task from Backlog to Next Up on the Data-Catalog board.Feb 28 2022, 5:13 PM

BTullis triaged this task as High priority.Mar 2 2022, 10:39 AM

Now starting to work on this task. I think the first thing I'll have to do is as for a service IP to be allocated.

BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.Mar 4 2022, 3:43 PM

I have assigned this address myself from NetBox, following these guidelines: https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox

https://netbox.wikimedia.org/ipam/ip-addresses/10434/
I have not yet updated the generated records: https://wikitech.wikimedia.org/wiki/DNS/Netbox#Update_generated_records

If useful, all the steps outlined in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service

As per instruction from @ayounsi I have also reserved the corresponding address in codfw, in case the service ever becomes available in both DCs.

Change 768663 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add a record for datahubsearch service

https://gerrit.wikimedia.org/r/768663

gerritbot added a project: Patch-For-Review.Mar 7 2022, 9:44 AM

This is the required DNS change to add the service name: datahubsearch.svc.eqiad.wmnet
https://gerrit.wikimedia.org/r/c/operations/dns/+/768663

This is the required change to start setting up the LVS configuration:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/768668

BTullis moved this task from Next Up to In Progress on the Data-Catalog board.Mar 7 2022, 3:02 PM

I have now merged that change and applied it to the datahubsearch servers.
We can see that it now has the realserver IP address (10.2.2.71) as an alias on the lo interface.

Notice: Applied catalog in 29.09 seconds
btullis@datahubsearch1001:~$ ip a sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.2.2.71/32 scope global lo:LVS
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

The service is in the service_setup state but it is almost ready to go into the lvs_setup state.

Change 769398 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move datahubsearch service from service_setup to lvs_setup

https://gerrit.wikimedia.org/r/769398

Change 769398 merged by Btullis:

[operations/puppet@production] Move datahubsearch service from service_setup to lvs_setup

https://gerrit.wikimedia.org/r/769398

Mentioned in SAL (#wikimedia-operations) [2022-03-09T13:50:59Z] <btullis> restarting pybal on lvs102 T301458

Change 768663 merged by Btullis:

[operations/dns@master] Add a record for datahubsearch service

https://gerrit.wikimedia.org/r/768663

Mentioned in SAL (#wikimedia-operations) [2022-03-09T13:59:11Z] <btullis> restarting pybal on lvs1019 T301458

Maintenance_bot removed a project: Patch-For-Review.Mar 9 2022, 2:10 PM

I'm not sure yet why it's not working.

This is my test against the backend server:

btullis@lvs1019:~$ curl http://datahubsearch.svc.eqiad.wmnet:9200/_cat/health
curl: (7) Failed to connect to datahubsearch.svc.eqiad.wmnet port 9200: Connection timed out

but we can talk to the backend servers:

btullis@lvs1019:~$ curl http://datahubsearch1001.eqiad.wmnet:9200/_cat/health
1646835058 14:10:58 datahub green 3 3 true 0 0 0 0 0 0 - 100.0%

The realserver IP address (10.2.2.71/32) is present on the loopback address of the backend hosts:

btullis@datahubsearch1001:~$ ip a sh lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.2.2.71/32 scope global lo:LVS
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

OK, all I had to do was to restart the opensearch_1@datahub service on the three datahubsearch servers.
They were already configured with network.host: [_local_,_site_] which serves opensearch on all local IP addresses on startup, but it doesn't pick up addresses that were added after startup as in this case.

Now my tests work as expected:

btullis@aqs1010:~$ curl http://datahubsearch.svc.eqiad.wmnet:9200/_cat/health
1646837793 14:56:33 datahub green 3 3 true 0 0 0 0 0 0 - 100.0%

I can put it into the monitoring_setup state next.

Change 769451 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add monitoring for the datahubsearch LVS service

https://gerrit.wikimedia.org/r/769451

gerritbot added a project: Patch-For-Review.Mar 9 2022, 3:07 PM

Change 769451 merged by Btullis:

[operations/puppet@production] Add monitoring for the datahubsearch LVS service

https://gerrit.wikimedia.org/r/769451

I have moved this to the monitoring_setup state, so the cluster will be monitored by Icinga, but it will not page. I think that this is the best state for it to be in during this phase of MVP development, but I will check whether it is OK to leave it like this for a while.

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2022, 11:10 AM

Change 770471 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the monitoring check for datahubsearch

https://gerrit.wikimedia.org/r/770471

gerritbot added a project: Patch-For-Review.Mar 14 2022, 11:11 AM

Change 770471 merged by Btullis:

[operations/puppet@production] Update the monitoring check for datahubsearch

https://gerrit.wikimedia.org/r/770471

Change 770475 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add single quotes around the regex to use

https://gerrit.wikimedia.org/r/770475

Change 770475 merged by Btullis: