Page MenuHomePhabricator

Investigate source of 404 Not Found responses from load.php
Closed, InvalidPublic

Description

Looking at https://grafana.wikimedia.org/dashboard/db/resourceloader we've got a really low rate of error responses.

5xx is 0 for days/weeks at a time. Only exception to this I found was during MW problems/outages not related to RL.

4xx is non-zero, every few minutes a few pop up here and there. This is fine, but it I couldn't quite understand why or what would cause them, given I know of no code in RL that would issue such a response.

Event Timeline

Change 454724 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mtail: Escape the '.' in /w/load.php for varnishrls.mtail

https://gerrit.wikimedia.org/r/454724

Krinkle triaged this task as Low priority.
Krinkle moved this task from Inbox to Confirmed Problem on the MediaWiki-ResourceLoader board.

Change 454724 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: Escape the '.' in /w/load.php for varnishrls.mtail

https://gerrit.wikimedia.org/r/454724

Still happening. Perhaps another url/regex edge case? Will tail the logs to see if I can catch a sample.

@fgiunchedi There's still a few being found in mtail/Prometheus under /w/load.php (404). Before I dig further, can you help confirm that the patch has been applied to all Varnish servers?

Here's a sample I found the log that might be the reason for the 404 errors seen in the graphs.

{
	"hostname":"cp3040.esams.wmnet", ..
	"cache_status":"miss",
	"http_status":"404", ..
	"http_method":"GET",
	"uri_host":"en.wikipedia.org",
	"uri_path":"/w/load.php%3Fdebug=false&lang=en&modules=startup&only=scripts&skin=vector",
	"uri_query":"", ..
	"x_cache":"cp1079 pass, cp3030 miss, cp3040 miss"
}

Looking at cp3040, I see that one of its mtail-related processes was last restarted 8 hours before the patch was merged. But. it does does have recent processes that relate to mtail as well. I'm not sure which one (if any) needs to restart for that patch to be effective:

Aug28   0:00 /bin/bash /usr/local/bin/varnishmtail /etc/mtail

If the patch was indeed the patch and the new regex matches this still, then I'll need to dig further why that is. For example, it might be a url decoding issue.

@Krinkle puppet already DTRT and restarts mtail when its programs change, i.e.

Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) --- /etc/mtail/varnishrls.mtail#0112018-08-28 06:18:18.574055203 +0000
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) +++ /tmp/puppet-file20180828-37866-ehjjuy#0112018-08-28 07:20:50.133272190 +0000
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) @@ -4,7 +4,9 @@
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)  /\thttp_method PURGE\t/ {
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)      # noop
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)  } else {
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) -    /^url \/w\/load.php.*\tcache_status (?P<x_cache>.*)\thttp_status (?P<status>[0-9]{3}).*\tcache_control (-|.*max-age=(?P<maxage>[0-9]+)).*\tinm (?P<inm>[[:print:]]+)\t.*$/ {
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) +    # Match: /w/load.php, /w/load.php?query, /w/load.php/path
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) +    # Do not match: /w/loadXphp,  /w/load.phpXX?query
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) +    /^url \/w\/load\.php([\?\/].*)?\tcache_status (?P<x_cache>.*)\thttp_status (?P<status>[0-9]{3}).*\tcache_control (-|.*max-age=(?P<maxage>[0-9]+)).*\tinm (?P<inm>[[:print:]]+)\t.*$/ {
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)          $inm != "-" {
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)              varnish_resourceloader_inm++
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content)          }
Aug 28 07:20:50 cp3040 puppet-agent[37866]: Computing checksum on file /etc/mtail/varnishrls.mtail
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]) Filebucketed /etc/mtail/varnishrls.mtail to puppet with sum f29963ac9c9b1082f1fa6fcb0565c093
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]/content) content changed '{md5}f29963ac9c9b1082f1fa6fcb0565c093' to '{md5}2c50ef2d955b71ecb9093c3ec4b8d930'
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Profile::Cache::Text/Varnish::Logging::Rls[rls]/Mtail::Program[varnishrls]/File[/etc/mtail/varnishrls.mtail]) Scheduling refresh of Service[varnishmtail]
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (Mtail::Program[varnishrls]) Scheduling refresh of Service[varnishmtail]
Aug 28 07:20:50 cp3040 puppet-agent[37866]: (/Stage[main]/Varnish::Logging/Systemd::Service[varnishmtail]/Service[varnishmtail]) Triggered 'refresh' from 2 events

One case I can see if puppet isn't running on a cp host while the host is receiving traffic, then of course the patch might not be applied.

The other mtail process you saw is varnishmtail-backend though that has simple/basic rules from a separate directory, /etc/varnishmtail-backend

Thanks. That means if the data point I saw and the cited sample are the same, there must be an issue with the regex. But, I suspect they're unrelated.

I was able to find a different sample more recently, that might explain this issue.

{
 "hostname":"cp5007.eqsin.wmnet", ..
 "http_status":"404", ..
 "http_method":"GET",
 "uri_host":"103.102.166.224",
 "uri_path":"/w/load.php",
 "uri_query":"?modules=skins....", ..
 "x_cache":"cp1075 pass, cp2016 pass, cp5012 miss, cp5007 miss"
}

The uri_host IP there is the IP address of text-lb.eqsin.wikimedia.org. The above can be reproduced via http://103.102.166.224/w/load.php?debug=T202479&404_from_app_server_default_virtualhost.

Once I knew that the problem stems from hostnames rather than regex matching of url paths/queries, I was able to find the following ways to trigger it:

  1. Hostnames we route to text-lb that have actual content but aren't wikis (e.g. some of the misc services recently migrated from cache_misc to cache_text).
  2. Hostnames we route to text-lb that Varnish doesn't recognise (receives varnish-generated errorpage with 404, "Domain not served here")
  3. Hostnames we route to text-lb that Varnish accepts but Apache doesn't (receives default VirtualHost from Apache, all paths 404)
  4. IP-as-hostname, which currently appears to make its way to Apaches, and gets 404 there from the default VirtualHost.

Case 1 is hard to avoid, because we'd have to either whitelist wiki domains, or blacklist misc domains, both of which I'd rather avoid in something like mtail.

Case 2 I'm actually unable to reproduce at the moment. Everything I've tried to throw at Varnish made it to Apache.

Case 3 is rare as it is only possible if you manually set a strange hostname header (e.g. via curl), or bypass SSL warnings, because all our domains are HTTPS now and if reaching by non-canonical means, the browser would refuse to connect. For example http://text-lb.eqsin.wikimedia.org gets HTST enforced, and then fails to connect, so this is failing as expected without issue.

Case 4 is surprising. I'd expect that to fall in case 2 and be refused by Varnish with a 404, or at the very least get a TLS redirect, but instead it's going to app servers and getting a regular HTTP response.

cc Traffic as they might help shine some light on what's expected and what isn't

  1. Hostnames we route to text-lb that Varnish doesn't recognise (receives varnish-generated errorpage with 404, "Domain not served here")

That is the case only for cache_upload.

# upload
$ curl -v --resolve test-ema.example.org:80:91.198.174.208 'http://test-ema.example.org/w/load.php?debug=T202478&404_from_app_server_default_virtualhost' 2>&1 | grep Server
< Server: Varnish
# text
$ curl -v --resolve test-ema.example.org:80:91.198.174.192 'http://test-ema.example.org/w/load.php?debug=T202478&404_from_app_server_default_virtualhost' 2>&1 | grep Server
< Server: mw2225.codfw.wmnet

On cache_text we route by default to the appservers all requests not matching any specific entry in cache::req_handling. See the default req_handling entry for cache_text, which is not defined on cache_upload.

  1. Hostnames we route to text-lb that Varnish doesn't recognise (receives varnish-generated errorpage with 404, "Domain not served here")

That is the case only for cache_upload.

Hm. I see. I believe it used to be the case for cache_misc as well. But since it is being merged into cache_text, I guess lack of Varnish handling for unknown domain names is becoming the norm rather than the exception.

It used to be that the only case of non-wiki responses on cache_text (and MW Apaches) were redirect domains, and the www portals. Which were somewhat easy to exclude or ignore.

It would make things easier to reason about if cache_text also discarded unknown/invalid host names and IP-as-host.

Once done, it would unlock the next step, which is to find a way for log processors to differentiate between wiki-domain responses and others. E.g. by adding a property of some sort to the cache_text all for wiki domains. Or perhaps we could expose the canonical name of the Varnish director-route in an internal X-header that log processors would see?

Krinkle changed the task status from Open to Stalled.Dec 10 2018, 2:24 AM
Krinkle removed Krinkle as the assignee of this task.
Krinkle moved this task from Doing (old) to Blocked (old) on the Performance-Team board.
Krinkle added a subscriber: BBlack.

@ema @BBlack This outcome of this task should be for the mtail/varnishrls metric collector to only apply to traffic for hosts where ResourceLoader is available, so as to stop the data pollution from unrelated requests for hosts where there is not meant to be an /w/load.php entry point.

I believe this is blocked on our Varnish configuration having a recognise requests for "real" text domains (eg. not former cache_misc domains, and requests with unknown domains or IPs as Host header).