Page MenuHomePhabricator

fourohfour tool returning "replag" as the assigned tool for all 404 urls
Closed, ResolvedPublic

Description

Noticed while I was attempting to move the app from py35 to py37.

Something strange seems to be happening to flask.request.url when the 404 handler logic is applied? I know I tested things like this quite a bit in the past, so it seems likely that this is related to the changes in proxy logic that we have added in the last couple of weeks.

Event Timeline

Possibly caused by T241008: New k8s cluster routing behaving strangely for bd808-test tool? If so, one way to fix it may be to use a custom deployment including an ingress with nginx.ingress.kubernetes.io/rewrite-target: / per the prior behavior.

The tool's $HOME/uwsgi.log is showing all URLs handled as GET /replag/... with the trailing component varying widely and fairly obviously being part of URLs routed to other tools. Something is rewriting all the paths to /replag/ in the proxy layers before calling the fourohfour tool itself to render a 404 page.

bd808 triaged this task as High priority.Dec 29 2019, 2:42 AM

I dumped the k8s ingress nginx config with kubectl exec -it -n ingress-nginx nginx-ingress-64dc7c9c57-2j7qd cat /etc/nginx/nginx.conf on tools-k8s-control-1. Inside it I found:

                location ~* "^/" {

                        set $namespace      "";
                        set $ingress_name   "";
                        set $service_name   "";
                set $service_port   "{0 0 }";
                        set $location_path  "/";
# [...snip...]

                        rewrite "(?i)/" /replag/$2 break;
                        proxy_pass http://upstream_balancer;

                        proxy_redirect                          off;

                }

Now I need to track down where this bogus rule is coming from.

Mentioned in SAL (#wikimedia-cloud) [2019-12-29T03:42:32Z] <bd808> Stop/Start cycle of webservice to see if it changes the active nginx ingress rules (T241525)

Stopping and starting the replag tool changes the config. Now it is showing rewrite "(?i)/" /stashbot/$2 break; in the location ~* "^/" block. My next step is to go look at the upstream config templates to see if they make it more clear why these rules are being generated.

A stop/start cycle of the stashbot webservice has migrated the generated rule to rewrite "(?i)/" /bash/$2 break;. Is it just the oldest ingress that gets chosen for this each time?

Is it just the oldest ingress that gets chosen for this each time?

Here are the ingress objects after the stashbot restart:

# kubectl get ing --all-namespaces
NAMESPACE         NAME         HOSTS               ADDRESS   PORTS   AGE
tool-bash         bash         tools.wmflabs.org             80      6d23h
tool-bd808-test   bd808-test   tools.wmflabs.org             80      126m
tool-fourohfour   fourohfour   tools.wmflabs.org             80      114m
tool-k8s-status   k8s-status   tools.wmflabs.org             80      2d23h
tool-replag       replag       tools.wmflabs.org             80      22m
tool-stashbot     stashbot     tools.wmflabs.org             80      4m32s

If the oldest is picked to set a rewrite rule for the root path, I would expect k8s-status to get picked when I restart the bash tool. Let's find out...

If the oldest is picked to set a rewrite rule for the root path, I would expect k8s-status to get picked when I restart the bash tool. Let's find out...

Confirmed. So at this point my guess at what is happening is:

  • Something in the config or the ingress code is causing the {{ template "SERVER" serverConfig $all $server }} macro expansion for tools.wmflabs.org to end with a location ~* "^/" block
  • This block fails to find expected objects via LUA lookups and instead ends up with uninitialized values for lookups such as {{ $ing := (getIngressInformation $location.Ingress $server.Hostname $location.Path) }}
  • The lookup of {{ $location.ConfigurationSnippet }} in this block gets the data for the oldest ingress in the cluster

We never noticed any of this before because the ingress for the fourohfour tool was the oldest ingress until I restarted it to migrate the service from py35 to py37.

Mentioned in SAL (#wikimedia-cloud) [2019-12-29T04:37:45Z] <bd808> Creating custom ingress to try to fix T241525

Mentioned in SAL (#wikimedia-cloud) [2019-12-29T04:37:45Z] <bd808> Creating custom ingress to try to fix T241525

This seems to have "fixed" things. The custom ingress I created is:

default-route.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  labels:
    name: fourohfour
  name: default-route
  namespace: tool-fourohfour
spec:
  rules:
  - host: tools.wmflabs.org
    http:
      paths:
      - backend:
          serviceName: fourohfour
          servicePort: 8000
        path: /

The resulting configuration is:

                location ~* "^/" {

                        set $namespace      "tool-fourohfour";
                        set $ingress_name   "default-route";
                        set $service_name   "fourohfour";
                set $service_port   "{0 8000 }";
                        set $location_path  "/";
# [...snip...]
                        proxy_pass http://upstream_balancer;

                        proxy_redirect                          off;

                }

What I would still like to understand is if the creation of the location ~* "^/" block is mandatory, or if there is a way to disable that in some other corner of the very, very complex ingress-nginx config system.

bd808 lowered the priority of this task from High to Medium.Dec 29 2019, 4:46 AM
bd808 added subscribers: Bstorm, aborrero.

Lowering priority now that things are working again, but leaving this open for more investigation and discussion. I want to chat with @aborrero and @Bstorm about some WP:BEANS ideas this has given me.