Page MenuHomePhabricator

Finalize the multi-dc configuration of AQS (nodejs) in codfw
Closed, DeclinedPublic

Description

Summary

We are in a situation whereby the AQS service is almost available multi-dc, but there are a few steps remaining before this could be said to be complete.

In the meantime we have a puppet error that is caused by an incomplete configuration.

We need to address the puppet error now, even though there are still some outstanding questions about the final multi-dc configuration of AQS itself.

Until yesterday puppet was simply reporting a change on every run, but since this patch was merged, puppet now fails to compile and therefore this needs to be fixed with some urgency.

Detail

The Cassandra cluster that supports the pageviews based endpoints of AQS is not multi-CD and has been expanded to 12 hosts in each of eqiad and codfw.
This cluster is fully replicated and ready to serve traffic.

However, we also deploy the nodejs based aqs application to the same hosts as cassandra and this is configured to use a standard LVS configuration to load-balance traffic between the 12 available hosts. This LVS configuration has not been added, so ever since the aqs20[01-12] servers were commissioned puppet has been attempting to write an empty realserver_ip into the file: /etc/default/wikimedia-lvs-realserver on every puppet run.

 /etc/default/wikimedia-lvs-realserver
-LVS_SERVICE_IPS="10.0.5.3"
+LVS_SERVICE_IPS=""

Puppet then triggered /usr/sbin/dpkg-reconfigure -p critical -f noninteractive wikimedia-lvs-realserver after changing the file, which restored it back to 10.0.5.3 and it was changed again 30 minutes later.

The realserver_ip value was empty because there is no value for an IP address to use in the service catalog:
https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L152

Adding this entry to the service catalog is one of the steps required in order to make this service available in codfw as well as eqiad, but this step has not been carried out.
I believe that it is safe to add this value and the correct IP address: 10.2.1.12 has alread been reserved in Netbox: https://netbox.wikimedia.org/ipam/ip-addresses/6921/

Questions

There are still some unanswered questions about whether AQS should be fully multi-DC, due to the location and configuration of its backend data stores.

These questions specifically arise because our druid-public cluster, which serves the mediawiki_history_reduced_YYYY_MM datasets.

  • This cluster is only present in eqiad
  • This cluster does not support TLS encryption for requests

Therefore, if we route AQS requests to codfw and they serve any of the mediwiki_history based endpoints, this will have the following effects:

  • Increased latency of these API requests due to 2 x cross-dc traffic
  • Requests to and responses from Druid crossing between eqiad and codfw without being encrypted

Possible Solutions

These are not necessarily mutually exclusive, but represent some of the possible ways to address the outstanding questions.

  1. Find a way of routing only pageviews based AQS requests to both data centres, whilst serving mediawiki_history based pageviews from eqiad only
  2. Enable TLS for druid-public and accept the increased latency for these cross-data-centre requests
  3. Deploy a driud-public cluster in codfw, which is arguably the most complete solution for multi-DC capability
  4. Roll back the partial multi-DC configuration and exclude the cassandra cluster in codfw from serving any AQS traffic

Event Timeline

BTullis triaged this task as High priority.Mar 3 2023, 11:21 AM

Bringing into the current sprint with high priority, owing to the need to fix the puppet compilation failure one way or another.

We don't necessarily need to solve the more strategic question of whether to route traffic to codfw with such high priority.

As a point of reference, conftool-data for aqs servers in codfw already exists and they are marked as inactive.

btullis@puppetmaster1001:~$ sudo -i confctl --quiet select cluster=aqs get
{"aqs1018.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1020.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1021.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1010.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1011.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1012.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1013.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1014.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1015.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1017.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs1019.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=aqs,service=aqs"}
{"aqs2010.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2011.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2012.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2002.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2003.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2004.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2006.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2009.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2001.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2005.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2007.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}
{"aqs2008.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=aqs,service=aqs"}

No DNS records for aqs.svc.codfw.wmet is present.

btullis@puppetmaster1001:~$ host aqs.svc.eqiad.wmnet
aqs.svc.eqiad.wmnet has address 10.2.2.12

btullis@puppetmaster1001:~$ host aqs.svc.codfw.wmnet
Host aqs.svc.codfw.wmnet not found: 3(NXDOMAIN)

No DNS records for service discovery is present either.

btullis@puppetmaster1001:~$ host aqs.discovery.wmnet
Host aqs.discovery.wmnet not found: 3(NXDOMAIN)

Change 894017 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add an entry in the service catalog for the aqs service running in codfw

https://gerrit.wikimedia.org/r/894017

Change 894024 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add forward and reverse entries for aqs.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/894024

Based on a quick ping, the round-trip time from the aqs servers in codfw to the druid-public cluster is about 32ms on a good day.

btullis@aqs2001:/etc/aqs$ ping -c 5 druid-public-broker.svc.eqiad.wmnet
PING druid-public-broker.svc.eqiad.wmnet (10.2.2.38) 56(84) bytes of data.
64 bytes from druid-public-broker.svc.eqiad.wmnet (10.2.2.38): icmp_seq=1 ttl=62 time=31.5 ms
64 bytes from druid-public-broker.svc.eqiad.wmnet (10.2.2.38): icmp_seq=2 ttl=62 time=31.6 ms
64 bytes from druid-public-broker.svc.eqiad.wmnet (10.2.2.38): icmp_seq=3 ttl=62 time=31.6 ms
64 bytes from druid-public-broker.svc.eqiad.wmnet (10.2.2.38): icmp_seq=4 ttl=62 time=31.6 ms
64 bytes from druid-public-broker.svc.eqiad.wmnet (10.2.2.38): icmp_seq=5 ttl=62 time=31.6 ms

--- druid-public-broker.svc.eqiad.wmnet ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 11ms
rtt min/avg/max/mdev = 31.527/31.588/31.625/0.228 ms

So this would be the additional latency for each mediawiki_history based AQS request if it were to be served from codfw.

I have had several discussions about this now and the consensus seems to be that:

  • We would definitely need to have TLS for any codfw->eqiad Druid calls, if and when aqs is pooled in cofw
  • A performance penalty of 30 ms for Druid calls from codfw would probably be acceptable

Therefore, I feel that we can proceed with merging https://gerrit.wikimedia.org/r/c/894017 and https://gerrit.wikimedia.org/r/c/894024 which will fix the puppet issue in the short term.

Then, to my mind, next step should be to test adding optional TLS encryption to Druid. We can start out by enabling this only on the test cluster, then follow it up by adding it to the public and then analytics clusters, in that order.

Then, to my mind, next step should be to test adding optional TLS encryption to Druid. We can start out by enabling this only on the test cluster, then follow it up by adding it to the public and then analytics clusters, in that order.

Having now discussed this with @odimitrijevic, @JAllemandou, @elukey and others, we are happy to look at adding the TLS authentication to Druid in order to secure any cross-dc traffic should there be any.

I'll raise an appropriate ticket and prioritise it for the next sprint.

Having discussed this with @Vgutierrez in #wikimedia-traffic I think we have a plan for how to proceed.

There would be an issue with merging https://gerrit.wikimedia.org/r/c/894017 (the change to the service catalog) unless we can also pool the 12 servers in codfw in conftool.
The reason for this is that we cannot currently have a service in production in one data centre and in service_setup in another data centre.
Therefore our pybal monitoring would generate alerts if the number of pooled servers in codfw were less than the depool_threshold

However, we believe that it will still be OK to pool these servers. They will receive any traffic because restbase directs all aqs traffic to aqs.svc.eqiad.wmnet

Ref: https://gerrit.wikimedia.org/g/operations/puppet/+/production/hieradata/common/profile/restbase.yaml#1

profile::restbase::aqs_uri: http://aqs.svc.eqiad.wmnet:7232/analytics.wikimedia.org/v1

Restbase doesn't know anything about aqs.svc.codfw.wmnet - In fact this DNS record hasn't even been created yet: Ref: https://gerrit.wikimedia.org/r/c/operations/dns/+/894024

So, the plan now is to do the following:

  1. Finish adding the aqs.svc.codfw.wmnet DNS record: https://gerrit.wikimedia.org/r/c/operations/dns/+/894024
  2. Update that record in netbox and run the sre.dns.netbox cookbook (as reminded by @Volans
  3. Pool the 12 new aqs2* servers in codfw using conftool and set their weight to >0
  4. Merge https://gerrit.wikimedia.org/r/c/894017 and deploy, ensuring that it runs on aqs20[01-12] and lvs20[09-10] before proceeding
  5. Restart pybal on lvs2009 and lvs2010 as per the procedure at: https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service

(@Vgutierrez has offered to carry out this step if desired)

  1. Check that no traffic is being received by the realservers on aqs20[01-12] and that AQS responses are not showing any errors.

Change 894024 merged by Btullis:

[operations/dns@master] Add forward and reverse entries for aqs.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/894024

Change 894017 merged by Btullis:

[operations/puppet@production] Add an entry in the service catalog for the aqs service running in codfw

https://gerrit.wikimedia.org/r/894017

Mentioned in SAL (#wikimedia-operations) [2023-03-09T13:03:24Z] <btullis@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: btullis-T331115 - btullis@cumin1001"

Mentioned in SAL (#wikimedia-operations) [2023-03-09T13:04:26Z] <btullis@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: btullis-T331115 - btullis@cumin1001"

BTullis lowered the priority of this task from High to Medium.Mar 9 2023, 3:09 PM

Removing this unplanned work from the current sprint and lowering the pririty of the ticket. The immediate issue of puppet not running on the new aqs20[01-12] nodes has now been resolved.
AQS traffic is not yet being sent to codfw, but most of the configuration is in place so that we could now do so.

In order to do so we wold need to update the resbase configuration here: https://gerrit.wikimedia.org/g/operations/puppet/+/production/hieradata/common/profile/restbase.yaml#1
and add support for the new service address aqs.svc.codfw.wmnet

There is also another child tiocket to resolve before we could do that (properly), which is T331631: Add optional TLS encryption to the druid-public-broker. Without the option of using TLS we would be violating a policy by requesting data from Druid in plaintext.

As part of T342213 I kinda jumped the gun and added aqs.discovery.wmnet records - should have checked in on this ticket, hope that's ok!

As part of T342213 I kinda jumped the gun and added aqs.discovery.wmnet records - should have checked in on this ticket, hope that's ok!

Thanks. All fine by me. I'm not really sure whether this ticket will ever get done as-is, or whether it will be made obsolete by the AQS 2.0 rollout.

I think we can probably decline this ticket now, given that we are so close to sunsetting AQS 1.0.
@VirginiaPoundstone - would you agree?

BTullis moved this task from Misc to Done on the Data-Platform-SRE board.

This is no longer necessary, since we have migrated all AQS endpoints to AQS 2.0.