Page MenuHomePhabricator

Allow access to wdqs.svc.eqiad.wmnet on port 8888
Open, NormalPublic

Description

As far as I can tell wdqs.svc.eqiad.wmnet will direct me to an active wdqs server 'always'.
https://github.com/wikimedia/puppet/blob/6698ee49e2f04292ba6f8041aed0f524bcf48753/hieradata/role/common/cache/misc.yaml#L135

Port 8888 was opened on the wdqs servers to allow for internal queries to run with a longer timeout T119941
https://github.com/wikimedia/puppet/blob/6698ee49e2f04292ba6f8041aed0f524bcf48753/hieradata/role/common/cache/misc.yaml#L135
https://github.com/wikimedia/puppet/blob/a0b0f48ca009934342e3710e42c6732994c6fbbd/modules/wdqs/manifests/gui.pp#L15
https://github.com/wikimedia/puppet/blob/a0b0f48ca009934342e3710e42c6732994c6fbbd/modules/wdqs/templates/nginx.erb#L80

Would it be possible to also access wdqs.svc.eqiad.wmnet on port 8888

Allowing this would allow me to remove the hard coding of an individual machine added in https://gerrit.wikimedia.org/r/#/c/380974/

Event Timeline

Addshore created this task.Sep 27 2017, 1:52 PM
Restricted Application added a project: Discovery. · View Herald TranscriptSep 27 2017, 1:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Addshore updated the task description. (Show Details)Sep 27 2017, 1:55 PM
Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

I wonder if it may be more beneficial to use codfw ones for longer tasks, since they are getting less routine traffic now.

BBlack moved this task from Triage to LoadBalancer on the Traffic board.Oct 23 2017, 2:51 PM
ema triaged this task as Normal priority.Nov 9 2017, 7:29 AM

Bump as this is probably trivial but needs the right pair of hands to get it done.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Mar 5 2018, 4:15 PM
elukey added a subscriber: elukey.Jul 2 2018, 12:41 PM
elukey added a comment.Jul 4 2018, 9:09 AM

After a chat with Discovery we ended up refreshing the list of hosts in the Analytics VLAN firewall (that is meant for traffic from the analytics hosts towards production, like stat1005 to wdqs):

https://phabricator.wikimedia.org/T198623#4396997

It seems that it is not possible to whitelist only the VIP IP wdqs.svc.eqiad.wmnet

It looks like this was the cause of the dashboard breaking again in T218710.

It is a shame that we can not whitelist wdqs.svc.eqiad.wmnet, I guess we will just have to keep manually changing which server we point at?
Unless anyone can think of another way?

@Addshore, just saw T218710 and clicked through to here. If you use https://wikitech.wikimedia.org/wiki/HTTP_proxy, you can access wdqs.svc.eqiad.wmnet over HTTP from the analytics VLAN.

@Addshore, just saw T218710 and clicked through to here. If you use https://wikitech.wikimedia.org/wiki/HTTP_proxy, you can access wdqs.svc.eqiad.wmnet over HTTP from the analytics VLAN.

Please don't do that. As the page very clearly says it's To allow HTTP requests reach the outside world, not to bypass internal restrictions

Ah, hm ok.

Actually, @elukey why can't we allow the VIP IP? We did this in T221690: Allow analytics VLAN to reach schema.svc.$site.wmnet, no?

elukey added a subscriber: ayounsi.EditedJul 9 2019, 3:30 PM

Not really, I wish myself from the past added more info. I asked to @ayounsi and he didn't come up with a reason not to, so in theory we could try to modify the term on the firewall and see how it goes. The config is currently:

elukey@re0.cr1-eqiad> show configuration firewall family inet filter analytics-in4 term wdqs
from {
    destination-address {
        /* wdqs1003 */
        10.64.0.14/32;
        /* wdqs1004 */
        10.64.0.17/32;
        /* wdqs1005 */
        10.64.48.46/32;
        /* wdqs2003 */
        10.192.0.29/32;
        /* wdqs2001 */
        10.192.32.148/32;
        /* wdqs2002 */
        10.192.48.65/32;
    }
    protocol tcp;
    destination-port 8888;
}
then accept;

That explicitly whitelist every target host. I recall that there was a reason behind it, but not which one :(

Changed the following: (Cc: @ayounsi )

elukey@re0.cr2-eqiad# show | compare
[edit firewall family inet filter analytics-in4 term wdqs from destination-address]
         10.192.48.65/32 { ... }
+        /* wdqs.svc.eqiad.wmnet */
+        10.2.2.32/32;

Now I can see telnet reaching the endpoint from stat1007, but getting connection refused:

elukey@stat1007:~$ telnet wdqs.svc.eqiad.wmnet 8888
Trying 10.2.2.32...
telnet: Unable to connect to remote host: Connection refused

I guess that something more is needed?

elukey added subscribers: WMDE-leszek, Ladsgroup.EditedAug 6 2019, 10:43 AM

Adding @WMDE-leszek and @Ladsgroup since afaics they were/are working on this :)

The idea would be to move all your scripts to the wdqs.svc.eqiad.wmnet 8888 endpoint if possible, and then clean up the explicit single host settings in the analytics firewall. Let me know your thoughts!

Gehel added a comment.EditedAug 6 2019, 11:59 AM

At the moment, we have a ferm rule to allow access to port 8888 from $DOMAIN_NETWORKS. I think this should be sufficient, but I'm always somewhat lost in our network.

As far as I can see, we don't have an LVS configuration for port 8888, so that needs to be addressed as well.

Side note: since we are expecting heavy queries, we should route those only to the public wdqs endpoint (wdqs.svc.{eqiad|codfw}.wmnet) and NOT to the private cluster (wdqs-internal.svc.{eqiad|codfw}.wmnet).

Change 529053 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/529053

Gehel added a comment.Aug 8 2019, 10:07 AM

A few more comments after discussion with @elukey :

  • the use of port 8888 to get extended query timeouts is exceptional and should only ever be used by analytics (or at least, new use cases needs to be vetted)
  • not having this go through LVS makes it fairly explicit that this is a hack and should not be used widely
  • if we add an LVS endpoint, we need to ensure that we have some control on who is accessing it
  • $ANALYTICS_NETWORK ferm alias could be used, but that's more restrictive than what we have now, so we need to check that no other clients is using this port
  • not directly related to this task: we don't have SSL termination on the wdqs servers, everything in is the clear, we should probably address that at some point

Change 530856 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] wdqs: restrict port 8888 to analytics networks

https://gerrit.wikimedia.org/r/530856

Change 530856 merged by Gehel:
[operations/puppet@production] wdqs: restrict port 8888 to analytics networks

https://gerrit.wikimedia.org/r/530856

Change 529053 merged by Vgutierrez:
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/529053

Change 535520 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/535520

Change 535528 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] wdqs: allow port 8888 for domain networks

https://gerrit.wikimedia.org/r/535528

Change 535528 merged by Vgutierrez:
[operations/puppet@production] wdqs: allow port 8888 for domain networks

https://gerrit.wikimedia.org/r/535528

Change 535520 merged by Vgutierrez:
[operations/puppet@production] lvs: allow access to wdqs lvs on port 8888

https://gerrit.wikimedia.org/r/535520

Mentioned in SAL (#wikimedia-operations) [2019-09-12T08:01:59Z] <vgutierrez> restarting pybal on lvs1016 - T176875

Mentioned in SAL (#wikimedia-operations) [2019-09-12T08:07:02Z] <vgutierrez> restarting pybal on lvs2006 - T176875

Mentioned in SAL (#wikimedia-operations) [2019-09-12T08:17:01Z] <vgutierrez> restarting pybal on lvs1015 and lvs2003 - T176875

@Addshore @Ladsgroup @WMDE-leszek, can you test that you can reach wdqs.svc.eqiad.wmnet on port 8888. LVS and other appropriate changes have been merged and It should work. Thanks

The requests work but the TLS ones give me this error:

ladsgroup@stat1007:~$ curl https://wdqs.svc.eqiad.wmnet:8888
curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol

Our config uses https://wdqs1005.eqiad.wmnet:8888 which also fail with this error. That's weird. We can switch to http:// for now until this gets fixed.

@Ladsgroup there's no TLS termination on that port for now. We should have and I will work on it in the nearest future. Please use HTTP for now

Change 536143 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] statistics: Use the new wdqs address

https://gerrit.wikimedia.org/r/536143

Change 536144 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[analytics/wmde/toolkit-analyzer@master] Use the new wdqs address

https://gerrit.wikimedia.org/r/536144

Change 536144 merged by jenkins-bot:
[analytics/wmde/toolkit-analyzer@master] Use the new wdqs address

https://gerrit.wikimedia.org/r/536144

Change 536143 merged by Elukey:
[operations/puppet@production] statistics: Use the new wdqs address

https://gerrit.wikimedia.org/r/536143