Page MenuHomePhabricator

Add druid coordinator service to LVS for the druid_public cluster.
Closed, ResolvedPublic

Description

As part of our work to unify all DSE services to use single urls per service as opposed to hardcoded hosts, we need to add the druid-coordinator service to lvs and use a single svc rul for the service

Event Timeline

Change #1198498 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] LVS: etcd data for druid-public-coordinator

https://gerrit.wikimedia.org/r/1198498

Change #1198499 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] LVS: Add druid-public-coordinator to service list

https://gerrit.wikimedia.org/r/1198499

Change #1198500 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/dns@master] DNS: Add druid-public-coordinator record

https://gerrit.wikimedia.org/r/1198500

Change #1199256 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] druid: add druid-coordinator to druid public worker role

https://gerrit.wikimedia.org/r/1199256

@Stevemunene checked with Traffic, this should be deployed today (Oct 29)

Change #1199763 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] LVS: set druid-coordinator to state lvs_setup

https://gerrit.wikimedia.org/r/1199763

Change #1199764 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] LVS: set druid-coordinator to state production

https://gerrit.wikimedia.org/r/1199764

We have had some delays due to scheduling conflicts and PTO. However, we have found some middle ground and have a slot anytime between 10:00 GMT and 12:00 GMT. for 5th Nov for the deploy.

Change #1198498 merged by Btullis:

[operations/puppet@production] LVS: etcd data for druid-public-coordinator

https://gerrit.wikimedia.org/r/1198498

Change #1198499 merged by Btullis:

[operations/puppet@production] LVS: Add druid-public-coordinator to service list

https://gerrit.wikimedia.org/r/1198499

Change #1199256 merged by Btullis:

[operations/puppet@production] druid: add druid-coordinator to druid public worker role

https://gerrit.wikimedia.org/r/1199256

I've merged the first three patches on this stack:

I'll wait until tomorrow to merge the next (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199763) which sets the service into state: lvs_setup - I'll also do this with the help of the Traffic team to apply the changes to pybal.

Change #1198500 merged by Btullis:

[operations/dns@master] DNS: Add druid-public-coordinator record

https://gerrit.wikimedia.org/r/1198500

Change #1199763 merged by Ssingh:

[operations/puppet@production] LVS: set druid-coordinator to state lvs_setup

https://gerrit.wikimedia.org/r/1199763

Mentioned in SAL (#wikimedia-operations) [2025-11-20T18:27:00Z] <sukhe> sukhe@lvs1020:~$ sudo systemctl restart pybal.service: T406222

all backing servers have been marked as pooled:

gehel@cumin2002:~$ sudo confctl select service=druid-public-coordinator get
{"druid1011.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-coordinator"}
{"druid1012.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-coordinator"}
{"druid1013.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-coordinator"}
{"druid1009.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-coordinator"}
{"druid1010.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-coordinator"}

@ssingh : I think the service is now ready. Could you help us move this to lvs_setup and production state? Is there something else missing from our side?

Change #1216793 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] LVS: set druid-coordinator to state lvs_setup

https://gerrit.wikimedia.org/r/1216793

Change #1216793 merged by Gehel:

[operations/puppet@production] LVS: set druid-coordinator to state lvs_setup

https://gerrit.wikimedia.org/r/1216793

Mentioned in SAL (#wikimedia-operations) [2025-12-09T13:48:54Z] <gehel> sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service' - T406222

Mentioned in SAL (#wikimedia-operations) [2025-12-09T13:53:42Z] <gehel> sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service' - T406222

This seems to be working, sending an HTTP 307 redirect to one of the druid node:

gehel@cumin1003:~$ curl -v -k http://druid-public-coordinator.svc.eqiad.wmnet:8081
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wikimediacloud.org,wmnet,127.0.0.1,::1'
*   Trying 10.2.2.15:8081...
* Connected to druid-public-coordinator.svc.eqiad.wmnet (10.2.2.15) port 8081 (#0)
> GET / HTTP/1.1
> Host: druid-public-coordinator.svc.eqiad.wmnet:8081
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 307 Temporary Redirect
< Date: Tue, 09 Dec 2025 13:55:21 GMT
< Location: http://druid1009.eqiad.wmnet:8081/
< Content-Length: 0
< Server: Jetty(9.4.12.v20180830)
< 
* Connection #0 to host druid-public-coordinator.svc.eqiad.wmnet left intact

Change #1216797 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] LVS: set druid-coordinator to state production

https://gerrit.wikimedia.org/r/1216797

Change #1216797 merged by Gehel:

[operations/puppet@production] LVS: set druid-coordinator to state production

https://gerrit.wikimedia.org/r/1216797

HTTP calls to druid-public-coordinator.svc.eqiad.wmnet:8081 result in an HTTP 307 redirect. I'm assuming that this is expected and that clients will follow those redirects.