Page MenuHomePhabricator

Allow access to wdqs.svc.eqiad.wmnet on port 8888
Closed, ResolvedPublic

Description

As far as I can tell wdqs.svc.eqiad.wmnet will direct me to an active wdqs server 'always'.
https://github.com/wikimedia/puppet/blob/6698ee49e2f04292ba6f8041aed0f524bcf48753/hieradata/role/common/cache/misc.yaml#L135

Port 8888 was opened on the wdqs servers to allow for internal queries to run with a longer timeout T119941
https://github.com/wikimedia/puppet/blob/6698ee49e2f04292ba6f8041aed0f524bcf48753/hieradata/role/common/cache/misc.yaml#L135
https://github.com/wikimedia/puppet/blob/a0b0f48ca009934342e3710e42c6732994c6fbbd/modules/wdqs/manifests/gui.pp#L15
https://github.com/wikimedia/puppet/blob/a0b0f48ca009934342e3710e42c6732994c6fbbd/modules/wdqs/templates/nginx.erb#L80

Would it be possible to also access wdqs.svc.eqiad.wmnet on port 8888

Allowing this would allow me to remove the hard coding of an individual machine added in https://gerrit.wikimedia.org/r/#/c/380974/

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript

I wonder if it may be more beneficial to use codfw ones for longer tasks, since they are getting less routine traffic now.

Bump as this is probably trivial but needs the right pair of hands to get it done.

After a chat with Discovery we ended up refreshing the list of hosts in the Analytics VLAN firewall (that is meant for traffic from the analytics hosts towards production, like stat1005 to wdqs):

https://phabricator.wikimedia.org/T198623#4396997

It seems that it is not possible to whitelist only the VIP IP wdqs.svc.eqiad.wmnet

It looks like this was the cause of the dashboard breaking again in T218710.

It is a shame that we can not whitelist wdqs.svc.eqiad.wmnet, I guess we will just have to keep manually changing which server we point at?
Unless anyone can think of another way?

@Addshore, just saw T218710 and clicked through to here. If you use https://wikitech.wikimedia.org/wiki/HTTP_proxy, you can access wdqs.svc.eqiad.wmnet over HTTP from the analytics VLAN.

@Addshore, just saw T218710 and clicked through to here. If you use https://wikitech.wikimedia.org/wiki/HTTP_proxy, you can access wdqs.svc.eqiad.wmnet over HTTP from the analytics VLAN.

Please don't do that. As the page very clearly says it's To allow HTTP requests reach the outside world, not to bypass internal restrictions

Not really, I wish myself from the past added more info. I asked to @ayounsi and he didn't come up with a reason not to, so in theory we could try to modify the term on the firewall and see how it goes. The config is currently:

elukey@re0.cr1-eqiad> show configuration firewall family inet filter analytics-in4 term wdqs
from {
    destination-address {
        /* wdqs1003 */
        10.64.0.14/32;
        /* wdqs1004 */
        10.64.0.17/32;
        /* wdqs1005 */
        10.64.48.46/32;
        /* wdqs2003 */
        10.192.0.29/32;
        /* wdqs2001 */
        10.192.32.148/32;
        /* wdqs2002 */
        10.192.48.65/32;
    }
    protocol tcp;
    destination-port 8888;
}
then accept;

That explicitly whitelist every target host. I recall that there was a reason behind it, but not which one :(

Changed the following: (Cc: @ayounsi )

elukey@re0.cr2-eqiad# show | compare
[edit firewall family inet filter analytics-in4 term wdqs from destination-address]
         10.192.48.65/32 { ... }
+        /* wdqs.svc.eqiad.wmnet */
+        10.2.2.32/32;

Now I can see telnet reaching the endpoint from stat1007, but getting connection refused:

elukey@stat1007:~$ telnet wdqs.svc.eqiad.wmnet 8888
Trying 10.2.2.32...
telnet: Unable to connect to remote host: Connection refused

I guess that something more is needed?

Adding @WMDE-leszek and @Ladsgroup since afaics they were/are working on this :)

The idea would be to move all your scripts to the wdqs.svc.eqiad.wmnet 8888 endpoint if possible, and then clean up the explicit single host settings in the analytics firewall. Let me know your thoughts!

At the moment, we have a ferm rule to allow access to port 8888 from $DOMAIN_NETWORKS. I think this should be sufficient, but I'm always somewhat lost in our network.

As far as I can see, we don't have an LVS configuration for port 8888, so that needs to be addressed as well.

Side note: since we are expecting heavy queries, we should route those only to the public wdqs endpoint (wdqs.svc.{eqiad|codfw}.wmnet) and NOT to the private cluster (wdqs-internal.svc.{eqiad|codfw}.wmnet).

Change 529053 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/529053

A few more comments after discussion with @elukey :

  • the use of port 8888 to get extended query timeouts is exceptional and should only ever be used by analytics (or at least, new use cases needs to be vetted)
  • not having this go through LVS makes it fairly explicit that this is a hack and should not be used widely
  • if we add an LVS endpoint, we need to ensure that we have some control on who is accessing it
  • $ANALYTICS_NETWORK ferm alias could be used, but that's more restrictive than what we have now, so we need to check that no other clients is using this port
  • not directly related to this task: we don't have SSL termination on the wdqs servers, everything in is the clear, we should probably address that at some point

Change 530856 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] wdqs: restrict port 8888 to analytics networks

https://gerrit.wikimedia.org/r/530856

Change 530856 merged by Gehel:
[operations/puppet@production] wdqs: restrict port 8888 to analytics networks

https://gerrit.wikimedia.org/r/530856

Change 529053 merged by Vgutierrez:
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/529053

Change 535520 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/535520

Change 535528 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] wdqs: allow port 8888 for domain networks

https://gerrit.wikimedia.org/r/535528

Change 535528 merged by Vgutierrez:
[operations/puppet@production] wdqs: allow port 8888 for domain networks

https://gerrit.wikimedia.org/r/535528

Change 535520 merged by Vgutierrez:
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/535520

Mentioned in SAL (#wikimedia-operations) [2019-09-12T08:01:59Z] <vgutierrez> restarting pybal on lvs1016 - T176875

Mentioned in SAL (#wikimedia-operations) [2019-09-12T08:07:02Z] <vgutierrez> restarting pybal on lvs2006 - T176875

Mentioned in SAL (#wikimedia-operations) [2019-09-12T08:17:01Z] <vgutierrez> restarting pybal on lvs1015 and lvs2003 - T176875

@Addshore @Ladsgroup @WMDE-leszek, can you test that you can reach wdqs.svc.eqiad.wmnet on port 8888. LVS and other appropriate changes have been merged and It should work. Thanks

The requests work but the TLS ones give me this error:

ladsgroup@stat1007:~$ curl https://wdqs.svc.eqiad.wmnet:8888
curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol

Our config uses https://wdqs1005.eqiad.wmnet:8888 which also fail with this error. That's weird. We can switch to http:// for now until this gets fixed.

@Ladsgroup there's no TLS termination on that port for now. We should have and I will work on it in the nearest future. Please use HTTP for now

Change 536143 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] statistics: Use the new wdqs address

https://gerrit.wikimedia.org/r/536143

Change 536144 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[analytics/wmde/toolkit-analyzer@master] Use the new wdqs address

https://gerrit.wikimedia.org/r/536144

Change 536144 merged by jenkins-bot:
[analytics/wmde/toolkit-analyzer@master] Use the new wdqs address

https://gerrit.wikimedia.org/r/536144

Change 536143 merged by Elukey:
[operations/puppet@production] statistics: Use the new wdqs address

https://gerrit.wikimedia.org/r/536143

Gehel lowered the priority of this task from Medium to Low.Sep 22 2020, 6:57 PM
Gehel claimed this task.

Looks like everything is done, please re-open if I've missed something.