Summary
We are in a situation whereby the AQS service is almost available multi-dc, but there are a few steps remaining before this could be said to be complete.
In the meantime we have a puppet error that is caused by an incomplete configuration.
We need to address the puppet error now, even though there are still some outstanding questions about the final multi-dc configuration of AQS itself.
Until yesterday puppet was simply reporting a change on every run, but since this patch was merged, puppet now fails to compile and therefore this needs to be fixed with some urgency.
Detail
The Cassandra cluster that supports the pageviews based endpoints of AQS is not multi-CD and has been expanded to 12 hosts in each of eqiad and codfw.
This cluster is fully replicated and ready to serve traffic.
However, we also deploy the nodejs based aqs application to the same hosts as cassandra and this is configured to use a standard LVS configuration to load-balance traffic between the 12 available hosts. This LVS configuration has not been added, so ever since the aqs20[01-12] servers were commissioned puppet has been attempting to write an empty realserver_ip into the file: /etc/default/wikimedia-lvs-realserver on every puppet run.
/etc/default/wikimedia-lvs-realserver -LVS_SERVICE_IPS="10.0.5.3" +LVS_SERVICE_IPS=""
Puppet then triggered /usr/sbin/dpkg-reconfigure -p critical -f noninteractive wikimedia-lvs-realserver after changing the file, which restored it back to 10.0.5.3 and it was changed again 30 minutes later.
The realserver_ip value was empty because there is no value for an IP address to use in the service catalog:
https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L152
Adding this entry to the service catalog is one of the steps required in order to make this service available in codfw as well as eqiad, but this step has not been carried out.
I believe that it is safe to add this value and the correct IP address: 10.2.1.12 has alread been reserved in Netbox: https://netbox.wikimedia.org/ipam/ip-addresses/6921/
Questions
There are still some unanswered questions about whether AQS should be fully multi-DC, due to the location and configuration of its backend data stores.
These questions specifically arise because our druid-public cluster, which serves the mediawiki_history_reduced_YYYY_MM datasets.
- This cluster is only present in eqiad
- This cluster does not support TLS encryption for requests
Therefore, if we route AQS requests to codfw and they serve any of the mediwiki_history based endpoints, this will have the following effects:
- Increased latency of these API requests due to 2 x cross-dc traffic
- Requests to and responses from Druid crossing between eqiad and codfw without being encrypted
Possible Solutions
These are not necessarily mutually exclusive, but represent some of the possible ways to address the outstanding questions.
- Find a way of routing only pageviews based AQS requests to both data centres, whilst serving mediawiki_history based pageviews from eqiad only
- Enable TLS for druid-public and accept the increased latency for these cross-data-centre requests
- Deploy a driud-public cluster in codfw, which is arguably the most complete solution for multi-DC capability
- Roll back the partial multi-DC configuration and exclude the cassandra cluster in codfw from serving any AQS traffic