Page MenuHomePhabricator

VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners)
Open, MediumPublic

Description

Various pages on Wikitech, document that one can make locally emulate external requests an Apache using the Host-header idiom. This makes sense.

For example, at https://wikitech.wikimedia.org/wiki/Application_servers and https://wikitech.wikimedia.org/wiki/Debugging_in_production#Locally. Typically something like:

mwdebug1001$ curl -H 'Host: en.wikipedia.org' "http://localhost/"
...
HTTP/1.1 301 Moved Permanently
Server: mwdebug1001.eqiad.wmnet
Location: https://en.wikipedia.org/wiki/Main_Page

or

mwdebug1001$ curl -H 'Host: en.wikipedia.org' "http://localhost/w/load.php"
...
HTTP/1.1 200 OK
Server: mwdebug1001.eqiad.wmnet
..
.. This file is the entry point for ResourceLoader ..

However, as of writing, this is not working. Instead, virtually any attempted url yields a 404 Not Found.

404 Not Found
mwdebug1002:~$ curl -v -H 'Host: test.wikipedia.org' "http://localhost/w/load.php"
* Hostname was NOT found in DNS cache
*   Trying ::1...
* Connected to localhost (::1) port 80 (#0)
> GET /w/load.php HTTP/1.1
> User-Agent: curl/7.38.0
> Accept: */*
> Host: test.wikipedia.org
> 
< HTTP/1.1 404 Not Found
< Date: Mon, 19 Mar 2018 23:43:16 GMT
* Server mwdebug1002.eqiad.wmnet is not blacklisted
< Server: mwdebug1002.eqiad.wmnet
< Content-Length: 327
< Content-Type: text/html; charset=iso-8859-1
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /w/load.php was not found on this server.</p>
<p>Additionally, a 404 Not Found
error was encountered while trying to use an ErrorDocument to handle the request.</p>
</body></html>
* Connection #0 to host localhost left intact

At different points over the past few years, this was broken before, each time we found a work around. At one point, I recall, it was important for (forgotten reasons, something about HTTPS) to leave the url unchanged and instead swap the TCP destination via DNS, as follows:

$ curl -v --resolve 'test.wikipedia.org:80:127.0.0.1' "http://test.wikipedia.org/w/load.php"

But this too doesn't work.

At another point, it was important to include -H 'X-Forwarded-Proto: https'. I think that's still the case for some things, but at the Apache level, most things support both now, with Vary.

I've tried many different variations, none work.

  • curl -v -H 'Host: test.wikipedia.org' "http://localhost/w/load.php" (plain, with -6, with -g, with -g6)
  • curl -v -H 'Host: test.wikipedia.org' "http://127.0.0.1/w/load.php"
  • curl -v -H 'Host: test.wikipedia.org' "http://[::1]/w/load.php" (plain, with -6, with -g, with -g6)
  • curl -v --resolve 'test.wikipedia.org:80:127.0.0.1' "http://test.wikipedia.org/w/load.php"
  • curl -v --resolve 'test.wikipedia.org:80:::1' "http://test.wikipedia.org/w/load.php" (plain, with -6, with -g, with -g6)

Eventually, I tried it from a different host to see if that would work. And by my surprise, that worked:

mwdebug1002$ curl -v -H 'Host: test.wikipedia.org' "http://mwdebug1001.eqiad.wmnet/w/load.php"`
..
HTTP/1.1 200 OK
Server: mwdebug1001.eqiad.wmnet
..

It also works from mwdebug1001 itself, and it works when using mwdebug's local 10.x IP address. These all do work:

  • mwdebug1002$ curl -v -H 'Host: test.wikipedia.org' "http://mwdebug1001.eqiad.wmnet/w/load.php"
  • mwdebug1001$ curl -v -H 'Host: test.wikipedia.org' "http://mwdebug1001.eqiad.wmnet/w/load.php"
  • mwdebug1001$ curl -v -H 'Host: test.wikipedia.org' "http://10.64.32.123.eqiad.wmnet/w/load.php"
  • mwdebug1001$ curl -v --resolve 'test.wikipedia.org:80:10.64.32.123' "http://test.wikipedia.org/w/load.php"

The first thing that came to mind at this point is that maybe something is doing the opposite of Require local and denying connections for all production sites from locally initiated connections. However, even if such thing were to exist, there are two things contradicting it:

  1. Locally initiating was still possible when using the local IP.
  2. It responds with our the custom default VirtualHost, not with an error page.

This last point is important. When removing the path component of the url and revealing the document root, shows that it does actually match one of our VirtualHost configurations, just not the one it is supposed to.

mwdebug1002:~$ curl -v -H 'Host: test.wikipedia.org' "http://localhost/"
* Connected to localhost (::1) port 80 (#0)
> Host: test.wikipedia.org
>  [..]
< HTTP/1.1 200 OK [..]
< Server: mwdebug1002.eqiad.wmnet [..]
< 
<!DOCTYPE html>
<html lang=en>
<meta charset="utf-8">
<title>Unconfigured domain</title>
<link rel="shortcut icon" href="//wikimediafoundation.org/favicon.ico">
<style> [..]

This is /srv/mediawiki/docroot/default/index.html as configured by puppet:/modules/mediawiki/files/apache/sites/nonexistent.conf.

So how come it is matching that one but not the main ones?

Event Timeline

Further testing shows that while it matches the custom default VirtualHost on mwdebug1001 and mwdebug1002, it behaves differently on a pooled app server (e.g. mw1299). There it responds with the Debian default page:

mw1299:~$ curl -v -H 'Host: test.wikipedia.org' "http://localhost/"
* Connected to ::1 (::1) port 80 (#0) [..]
> Host: www.mediawiki.org
>  [..]
< Server: Apache [..]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <title>Apache2 Debian Default Page: It works</title>

I still don't know why the fallback happens, why that fallback varies between mwdebug and app servers, or why there is no error message. But I did find out what the culprit was: server-status configuration. Since the resolution of T113090 with https://gerrit.wikimedia.org/r/#/c/239998/, a new virtualhost is registered by configuration:

<VirtualHost 127.0.0.1:80>
  ServerName localhost
  ServerAlias 127.0.0.1
  <Location /server-status>
    [..] Require local

Removing this made it all work again.

mwdebug1001$ sudo rm /etc/apache/conf-enabled/50-server-status.conf
mwdebug1001$ sudo /etc/init.d/apache2 restart

It seems Apache is considering the remote IP address more important than the Host header. And it is even expanding 127.0.0.1 to apply to connections for ::1 (IPv6) as well.

I've documented a work-around at https://wikitech.wikimedia.org/wiki/Debugging_in_production#Locally, by using the LAN address rather than the localhost address. E.g.

$ curl -v -H 'Host: test.wikipedia.org' "http://$(hostname -i)/"

Forgot to say: The aforementioned workaround is not actually a workaround (sorry). The hostname-i hack "works" in the sense that it ends up routing to the MediaWiki virtualhost, which is good, but my debugging to work, I need the application itself (MediaWiki) to see that the request comes from the local 127.0.0.1 or ::1 interface to allow additional debug actions (such as dumping stuff to /tmp).

Those actions are restricted with REMOTE_ADDR, and connecting to Apache using the 10.x address (even from the same server), does not make REMOTE_ADDR be the 127 or ::1 address. That makes sense, but it also means I'm still blocked :)

I investigated a bit on the part ".. on mwdebug1001 and mwdebug1002, .. behaves differently on a pooled app server (e.g. mw1299)".

I found that a canary appserver like mw1261 and mwdebug1001 have identical apache config, but mw1299 does NOT.

/etc/apache2/apache2.conf is different on these. Among the differences is:

mw1299 does: IncludeOptional conf-enabled/*.conf
mw1261 does: Include conf-enabled/*.conf

The version of apache2.conf that canaries and mwdebug has matches the puppet repo template:

mediawiki/templates/apache/apache2.conf.erb

The version on pooled appserver m1299 is different.

The template above gets installed from the class mediawiki::web which gets included in role/manifests/mediawiki/webserver.pp

mw1299 is special because it's a jobrunner, using role(mediawiki::jobrunner) . That role does not include the mediawiki::webserver unlike the others.

This should explain why mw1299 is different.

Try testing with a pooled regular appserver that is not a jobrunner, like mw1267.eqiad.wmnet (role(mediawikI::appserver)).

RobH triaged this task as Medium priority.May 1 2018, 3:18 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-29T07:11:57Z] <mutante> mw1293 (canary jobrunner ) replace apache2.conf with version from mwdebug1001, restart apache, to debug for T190111

Just confirmed this is still the case. mw1299, as a jobrunner, behaves differently from mwdebug1001 and mw1267.

I copied the apache2.conf from mwdebug1001 over to mw1293, a canary jobrunner, and could confirm with that the curl -v -H 'Host: test.wikipedia.org' "http://localhost/" shows the same result as on mwdebug1001.

@Krinkle Is this still an issue once you know that jobrunners are unlike other regular appservers?

@Joe Should we just install the apache2.conf from mediawiki/apache/apache2.conf.erb on jobrunners as well, just like we do on other appservers? Not by including the entire profile::mediawiki::httpd but just that one file as it is defined in there. That would do it.

Dzahn renamed this task from VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost to VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners).May 29 2020, 7:17 AM

Change 599683 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mediawiki: also include the MW apache2.conf on jobrunners

https://gerrit.wikimedia.org/r/599683

My take on this is that we should rather move jobrunners to use a normal endpoint and thus use the same configuration as all other appservers, and @hnowlan did start working on this but it got stalled because I had no time to assist in the transition.

We will pick that up at a later time.

Change 599683 abandoned by Dzahn:
[operations/puppet@production] mediawiki: also include the MW apache2.conf on jobrunners

Reason:

https://gerrit.wikimedia.org/r/599683

This is causing an issue with our mailman3 puppetization, where the mailman3 service wants to talk to hyperkitty (archiver) over localhost.

My understanding of our current apache configuration is that the problematic part is:

ServerName localhost
ServerAlias 127.0.0.1
ServerAlias ::1

which makes requests to localhost hit the virtualhost that only has a route for /server-status.

My proposal is to use a different name like ServerName apache2-status. Then to hit the endpoint, you need to do something like curl -H "Host: apache2-status" http://localhost/server-status, which puts the onus for doing something different on this endpoint rather than all requests to localhost.

Problem: prometheus-apache-exporter only learned how to use a Host header in 0.6.0: https://github.com/Lusitaniae/apache_exporter/commit/2cb6f0d60d74557c5374e91891b25187716ea9a1 so we'd need to backport or upgrade.

Alternative suggestion: just put the endpoint behind HTTP Basic Auth, and put the credentials in a .netrc readable to all shell users (and in the prometheus URL and anything else that calls it).

This is causing an issue with our mailman3 puppetization, where the mailman3 service wants to talk to hyperkitty (archiver) over localhost.

My understanding of our current apache configuration is that the problematic part is:

ServerName localhost
ServerAlias 127.0.0.1
ServerAlias ::1

which makes requests to localhost hit the virtualhost that only has a route for /server-status.

My proposal is to use a different name like ServerName apache2-status. Then to hit the endpoint, you need to do something like curl -H "Host: apache2-status" http://localhost/server-status, which puts the onus for doing something different on this endpoint rather than all requests to localhost.

IMHO this is an acceptable solution

Problem: prometheus-apache-exporter only learned how to use a Host header in 0.6.0: https://github.com/Lusitaniae/apache_exporter/commit/2cb6f0d60d74557c5374e91891b25187716ea9a1 so we'd need to backport or upgrade.

Indeed, we'll also get there with Bullseye (0.8.0)

Alternative suggestion: just put the endpoint behind HTTP Basic Auth, and put the credentials in a .netrc readable to all shell users (and in the prometheus URL and anything else that calls it).

To me this seems to have obvious disadvantages vs matching Host. Also can mailman3 use a specific Host ?