Page MenuHomePhabricator

ATS doesn't support X-Wikimedia-Debug
Closed, ResolvedPublic

Description

Our ATS config doesn't seem to support the re-routing of X-Wikimedia-Debug requests to hassium/hassaleh, and for developers on the west coast of the US, the conversion of ulsfo now means their testing workflow is broken.

In theory this looks fairly simple, I think something like:

function do_remap()
    if ts.client_request_get_url_host() == 'appservers-rw.svc.wmnet' and ts.client_request.header['X-Wikimedia-Debug'] then
	ts.client_request.set_url_host('hassium.eqiad.wmnet')
        return TS_LUA_REMAP_DID_REMAP
    end
    return 0
end

... but then there's another couple of layers to this onion:

  1. hassium and hassaleh don't seem to have HTTPS on port 443, just regular old HTTP on port 80 ...
  2. They also don't have a matching discovery-dns to replace the pair of per-DC hostnames

Event Timeline

BBlack triaged this task as High priority.Nov 7 2019, 8:58 PM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Reading up on the debug_proxy stuff a bit more.... currently hassium/hassaleh are proxies into mwdebug[12]00[12], and use the header to select the destination host, and also has some backwards compatibility for older values. We could potentially skip/eliminate the debug proxy layer and handle this directly as well. The underlying mwdebug hosts actually do have TLS configured already (like the non-debug appservers).

You can see all of how the existing setup works looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/role/manifests/debug_proxy.pp and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/debug_proxy/templates/debug_proxy.nginx.erb , and once we're making a TS Lua module for this debug re-routing at all, we may as well just move all of that logic into there?

Maybe this is closer to a Lua replacement for all of it, although it still has issues!

local debug_map = {
    '1'                       => 'mwdebug1001.eqiad.wmnet',
    'mwdebug1001.eqiad.wmnet' => 'mwdebug1001.eqiad.wmnet',
    'mwdebug1002.eqiad.wmnet' => 'mwdebug1002.eqiad.wmnet',
    'mwdebug2001.codfw.wmnet' => 'mwdebug2001.codfw.wmnet',
    'mwdebug2002.codfw.wmnet' => 'mwdebug2002.codfw.wmnet',
    'mw1017.eqiad.wmnet'      => 'mwdebug1001.eqiad.wmnet',
    'mw1099.eqiad.wmnet'      => 'mwdebug1002.eqiad.wmnet',
    'mw2017.codfw.wmnet'      => 'mwdebug2001.codfw.wmnet',
    'mw2099.codfw.wmnet'      => 'mwdebug2002.codfw.wmnet',
}

function do_remap()
    if ts.client_request.get_url_host() == 'appservers-rw.svc.wmnet' then
    local xwd = ts.client_request.header['X-Wikimedia-Debug']
    if xwd then
        _, _, behost = string.find(xwd, '~backend=(%S+)')
        mapped_host = debug_map[behost or xwd]
        if mapped_host then
            ts.client_request.set_url_host($mapped_host)
            return TS_LUA_REMAP_DID_REMAP
        else
            # XXX return 400 error
        end
    end
end

Change 549840 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: X-Wikimedia-Debug request routing implementation

https://gerrit.wikimedia.org/r/549840

Change 549840 merged by Ema:
[operations/puppet@production] ATS: X-Wikimedia-Debug request routing implementation

https://gerrit.wikimedia.org/r/549840

Mentioned in SAL (#wikimedia-operations) [2019-11-11T09:41:15Z] <ema> test x-wikimedia-debug-routing.lua on cp4027 (depooled) T237687

Change 550103 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: skip the cache if X-Wikimedia-Debug is valid

https://gerrit.wikimedia.org/r/550103

Change 550103 merged by Ema:
[operations/puppet@production] ATS: skip the cache if X-Wikimedia-Debug is valid

https://gerrit.wikimedia.org/r/550103

Mentioned in SAL (#wikimedia-operations) [2019-11-11T10:16:26Z] <ema> repool cp4027 after successful X-Wikimedia-Debug testing P9585 T237687

The functionality is now deployed to production, a brief illustration follows.

Valid XWD header:

$ curl -s -v -H "X-Wikimedia-Debug: backend=mwdebug2002.codfw.wmnet" https://en.wikipedia.org/wiki/Main_Page --resolve en.wikipedia.org:443:198.35.26.96 2>&1 | egrep "< (HTTP/2|server:|x-cache:)"
< HTTP/2 200 
< server: mwdebug2002.codfw.wmnet
< x-cache: cp4031 pass, cp4030 pass

Invalid XWD header, but with proper debug server hostname:

$ curl -s -v -H "X-Wikimedia-Debug: mwdebug2001.codfw.wmnet" https://en.wikipedia.org/wiki/Main_Page --resolve en.wikipedia.org:443:198.35.26.96 2>&1 | egrep "< (HTTP/2|server:|x-cache:)"
< HTTP/2 200 
< server: mwdebug2001.codfw.wmnet
< x-cache: cp4027 pass, cp4030 pass

Invalid XWD:

$ curl -s -v -H "X-Wikimedia-Debug: who knows what I'm doing" https://en.wikipedia.org/wiki/Main_Page --resolve en.wikipedia.org:443:198.35.26.96 2>&1 | egrep "< (HTTP/2|server:|x-cache:)"
< HTTP/2 400 
< server: ATS/8.0.5
< x-cache: cp4031 bug, cp4030 pass

Change 550774 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] ATS: Support X-W-D for /w/api.php as well

https://gerrit.wikimedia.org/r/550774

Change 550774 merged by Ema:
[operations/puppet@production] ATS: Support X-W-D for /w/api.php as well

https://gerrit.wikimedia.org/r/550774

Krinkle reopened this task as Open.EditedNov 20 2019, 8:26 PM
Krinkle subscribed.

if ts.client_request.get_url_host() == 'appservers-rw.svc.wmnet' then

Looks like condition may've been lost. Re-opening this for now, but not 100% sure it was this change that caused the issue.

The issue - When X-Wikimedia-Debug is enabled (e.g. via the WikimediaDebug browser extension), I am no longer able to browse https://logstash.wikimedia.org or https://phabricator.wikimedia.org because its requests are now also affected by the header and result in a 404 Error.

The issue - When X-Wikimedia-Debug is enabled (e.g. via the WikimediaDebug browser extension), I am no longer able to browse https://logstash.wikimedia.org or https://phabricator.wikimedia.org because its requests are now also affected by the header and result in a 404 Error.

I suspect that this might actually be due to this Varnish patch applied to support XWD for noc.wikimedia.org: T233768. @Krinkle Can you please try to reproduce and report the X-Cache value of those 404 responses?

ema lowered the priority of this task from High to Medium.Nov 22 2019, 12:59 PM

Change 552507 had a related patch set uploaded (by Krinkle; owner: Ema):
[operations/puppet@production] Revert "vcl: move XWD pass logic to wm_common"

https://gerrit.wikimedia.org/r/552507

Change 552507 merged by Ema:
[operations/puppet@production] Revert "vcl: move XWD pass logic to wm_common"

https://gerrit.wikimedia.org/r/552507

if ts.client_request.get_url_host() == 'appservers-rw.svc.wmnet' then

Looks like condition may've been lost. Re-opening this for now, but not 100% sure it was this change that caused the issue.

The issue - When X-Wikimedia-Debug is enabled (e.g. via the WikimediaDebug browser extension), I am no longer able to browse https://logstash.wikimedia.org or https://phabricator.wikimedia.org because its requests are now also affected by the header and result in a 404 Error.

This is now fixed:

$ curl -s -v -H "X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet" https://phabricator.wikimedia.org/T237687 2>&1 | egrep "< (HTTP/2|server:)"
< HTTP/2 200 
< server: envoy
$ curl -s -v -H "X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet" https://en.wikipedia.org/wiki/Main_Page 2>&1 | egrep "< (HTTP/2|server:)"
< HTTP/2 200 
< server: mwdebug1002.eqiad.wmnet