Page MenuHomePhabricator

Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http'
Closed, ResolvedPublicPRODUCTION ERROR

Description

Steps to replicate the issue (include links if applicable):

  • visit any function on Wikifunctions that doesn't have a connected built in implentation (e.g. Z10000)
  • enter any input(s)
  • click "Run function"
  • wait while the function runs (which seems to be taking an abnormaly long time)

What happens?:
Where the output of the function should be (underneath "Result"), only the text "http" is displayed. The normal "Details" link is not present, nor is the drop down ">" button that is normally next to the output.

What should have happened instead?:
The actual output of the function should have been displayed, along with the afformentioned "Details" link and drop down ">" button.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.): I'm using Firefox version 116.0.3. The issue is also occuring on Edge version 116.0.1938.54. I think (although I may be misremembering) that the last function (without a built in implementation) I was able to succesfully run did return "http" as its (correct) output. That would have been yesterday. Functions with a connected built in implementation work as expected.

Z10237:

image.png (365×242 px, 9 KB)

Z10018:

image.png (322×363 px, 7 KB)

Event Timeline

Jdforrester-WMF renamed this task from Wikifunctions functions that use non-built in implementations return 'http' regardless of the input. to Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http'.Aug 25 2023, 5:38 PM
Jdforrester-WMF triaged this task as Unbreak Now! priority.
Jdforrester-WMF subscribed.

No recent changes to the service (no deploys in a couple of weeks), and it's working fine in staging, but can confirm.

Jdforrester-WMF changed the subtype of this task from "Bug Report" to "Production Error".Aug 25 2023, 5:43 PM

Change 952434 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Revert "wikifunctions: Fix networkpolicies"

https://gerrit.wikimedia.org/r/952434

Change 952484 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Bump evaluator to 2023-08-08-220047

https://gerrit.wikimedia.org/r/952484

Change 952484 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Bump evaluator to 2023-08-08-220047

https://gerrit.wikimedia.org/r/952484

Change 952434 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "wikifunctions: Fix networkpolicies"

https://gerrit.wikimedia.org/r/952434

Change 952435 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Revert "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions"

https://gerrit.wikimedia.org/r/952435

Change 952435 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions"

https://gerrit.wikimedia.org/r/952435

Change 952436 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Re-apply "wikifunctions: Fix networkpolicies"

https://gerrit.wikimedia.org/r/952436

Change 952437 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Re-apply "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions"

https://gerrit.wikimedia.org/r/952437

Unfortunately at this point I'm out of ideas as to what's causing the issue.

@Jdforrester-WMF Hi! I am ignorant about Wikifunctions so please be patient, but I have a question about the reverts: I see that https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952435 was reverted (again) since it didn't fix things, but how did you apply it? The admin_ng stuff requires SRE priviledges, not a regular deployment to the target application's namespace. Have you done it before? (curious because I saw only SREs handling that up to now).

On the logs front - in the logstash dashboard for k8s I noticed some occurrences of this error:

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2023.08.26?id=aGNkMIoBt3kDdlqnHU9y

Call tuples failed in returnOnFirstError. Error: FetchError: request to https://api-rw.discovery.wmnet/w/api.php?action=wikilambda_fetch&format=json&zids=Z10000 failed, reason: connect ETIMEDOUT 10.2.2.22:443.

I checked with nsenter and the pod cannot reach api-rw, and afaics it is not configured to use the envoy tls proxy. Maybe we are simply missing an egress rule?

Moreover - is there anything in the logs that could explain what is happening? I feel that reverting and re-applying changes may lead to false positive results (see the admin_ng one).

Answering to myself - I see in admin_ng that https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952435/ is still to be applied:

kube-system, allow-uncached-api, GlobalNetworkPolicy (crd.projectcalico.org) has changed:
  # Source: calico/templates/globalnetworkpolicy.yaml
  apiVersion: crd.projectcalico.org/v1
  kind: GlobalNetworkPolicy
  metadata:
    name: allow-uncached-api            <<<<===================
    labels:
      helm.sh/chart: calico-0.2.10
      app.kubernetes.io/name: calico
      app.kubernetes.io/instance: calico
      app.kubernetes.io/version: "3.23.3"
      app.kubernetes.io/managed-by: Helm
  spec:
    egress:
    - action: Allow
      destination:
        nets:
        - 10.2.2.22/32
        - 10.2.1.22/32
        ports:
        - 80
        - 443
      protocol: TCP
-   namespaceSelector: has(projectcalico.org/name) && projectcalico.org/name not in {"kube-system",
-     "wikifunctions"}
+   namespaceSelector: has(projectcalico.org/name) && projectcalico.org/name != "kube-system"

Summary:

Reverts merged:

Change 952436 merged by jenkins-bot:

[operations/deployment-charts@master] Re-apply "wikifunctions: Fix networkpolicies"

https://gerrit.wikimedia.org/r/952436

Change 952782 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Fix wikifunctions orchestrator not using the service mesh

https://gerrit.wikimedia.org/r/952782

Change 952437 merged by jenkins-bot:

[operations/deployment-charts@master] Re-apply "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions"

https://gerrit.wikimedia.org/r/952437

JMeybohm lowered the priority of this task from Unbreak Now! to High.Aug 28 2023, 8:20 AM
JMeybohm subscribed.

Thanks @elukey for stepping in!

I did deploy those changes last week and the tests described in https://wikitech.wikimedia.org/wiki/Wikifunctions/Runbook where okay, so I was confident that I did not break anything. The orchestrator is obviously connecting to mw-api directly, which it is not supposed to do. Those connections need to go through the service mesh.
I drafted a change here https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952782 although I think this won't work as is as you will probably have to provide the correct host header (api-rw.discovery.wmnet) for the calls.

I have temporarily allowed direct access to mw-api again to unblock this.

Tested the use case outlined in the task's description, all works now!

@Jdforrester-WMF it will be extremely useful for the future to improve wikifunction's observability, for instance, there were logs emitted to logstash which logged that specific problem, and would be visible with a better curated dashboard.

wikifunctions logstash

request to https://api-rw.discovery.wmnet/w/api.php?action=wikilambda_fetch&format=json&zids=Z10618 failed, reason: connect ETIMEDOUT 10.2.2.22:443

Thanks @elukey for stepping in!

I did deploy those changes last week and the tests described in https://wikitech.wikimedia.org/wiki/Wikifunctions/Runbook where okay, so I was confident that I did not break anything.

Yeah, sorry about this. In this case the problem was not apparent when making curl requests from the deployment server, but were when coming from the MW cluster; presumably there are extra rights bundled along with those somehow?

The primary service alerting would have spotted this issue, but I'd not written it yet because of other tasks.

The orchestrator is obviously connecting to mw-api directly, which it is not supposed to do. Those connections need to go through the service mesh.
I drafted a change here https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952782 although I think this won't work as is as you will probably have to provide the correct host header (api-rw.discovery.wmnet) for the calls.

I have temporarily allowed direct access to mw-api again to unblock this.

Thanks! Given that we can't test from the command line against staging, do you know if we could make sure that the call is working without a full deploy? We're setting the Host: wikifunctions.org header in the service (and have been since last month); is that sufficient?

@Jdforrester-WMF it will be extremely useful for the future to improve wikifunction's observability, for instance, there were logs emitted to logstash which logged that specific problem, and would be visible with a better curated dashboard.

wikifunctions logstash

request to https://api-rw.discovery.wmnet/w/api.php?action=wikilambda_fetch&format=json&zids=Z10618 failed, reason: connect ETIMEDOUT 10.2.2.22:443

Which specific problem? The link you've provided is showing the ETIMEDOUT in logstash; do you mean a different message? Log it to a different channel?

Thanks @elukey for stepping in!

I did deploy those changes last week and the tests described in https://wikitech.wikimedia.org/wiki/Wikifunctions/Runbook where okay, so I was confident that I did not break anything.

Yeah, sorry about this. In this case the problem was not apparent when making curl requests from the deployment server, but were when coming from the MW cluster; presumably there are extra rights bundled along with those somehow?

No. I think the code is doing something different when called from mediawiki instead of curl. The issue clearly is that with the curl command the orchestrator does not call back to mw-api.

Thanks! Given that we can't test from the command line against staging, do you know if we could make sure that the call is working without a full deploy? We're setting the Host: wikifunctions.org header in the service (and have been since last month); is that sufficient?

As said above I think you should be able to test from the command line against staging (with the correct request).
I would assume setting the header that way works as you already send it to the mw-api discovery endpoint (although I wonder what you have WIKI_VIRTUAL_HOST defined for in the charts environment variables). Please see https://wikitech.wikimedia.org/wiki/Envoy#Use_a_listener on general instructions how to use the service mesh listener. Setting a proper user-agent would also be a good idea I suppose.

Jdforrester-WMF renamed this task from Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' to Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http'.Aug 28 2023, 3:22 PM

Change 952782 merged by jenkins-bot:

[operations/deployment-charts@master] Fix wikifunctions orchestrator not using the service mesh

https://gerrit.wikimedia.org/r/952782

Change 953212 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Re-apply "Fix wikifunctions orchestrator not using the service mesh"

https://gerrit.wikimedia.org/r/953212

Change 953212 merged by jenkins-bot:

[operations/deployment-charts@master] Re-apply "Fix wikifunctions orchestrator not using the service mesh"

https://gerrit.wikimedia.org/r/953212

I re-deployed after we moved to the version of the image with the wmf-certificates package, but it still errored in production but worked in staging.

curl https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z10000","Z10000K1":"foo","Z10000K2":"bar"},"doValidate":false}' --header "Content-type: application/json"

{"Z1K1":"Z22","Z22K1":"Z24","Z22K2":{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z883","Z883K1":"Z6","Z883K2":"Z1"},"K1":[{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"errors","K2":{"Z1K1":"Z5","Z5K1":"Z507","Z5K2":{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z885","Z885K1":"Z507"},"Z507K1":{"Z1K1":"Z99","Z99K1":{"Z1K1":"Z7","Z7K1":"Z10000","Z10000K1":"foo","Z10000K2":"bar"}},"Z507K2":{"Z1K1":"Z99","Z99K1":{"Z1K1":{"Z1K1":"Z9","Z9K1":"Z5"},"Z5K1":{"Z1K1":"Z9","Z9K1":"Z500"},"Z5K2":{"Z1K1":{"Z1K1":{"Z1K1":"Z9","Z9K1":"Z7"},"Z7K1":{"Z1K1":"Z9","Z9K1":"Z885"},"Z885K1":{"Z1K1":"Z9","Z9K1":"Z500"}},"Z500K1":"Call tuples failed in returnOnFirstError. Error: FetchError: invalid json response body at http://localhost:6501/w/api.php?action=wikilambda_fetch&format=json&zids=Z10000 reason: Unexpected token u in JSON at position 0."}}}}}},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationMemoryUsage","K2":"131.94 MiB"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationCpuUsage","K2":"5.903 ms"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationStartTime","K2":"2023-09-25T14:35:20.453Z"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationEndTime","K2":"2023-09-25T14:35:20.708Z"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationDuration","K2":"255 ms"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationHostname","K2":"function-orchestrator-main-orchestrator-84bf7bd68f-fgspt"}]}}

Again, it worked in staging. Clearly there's a config difference in the mesh that means we can't use it to verify before rolling out to user-facing prod.

Change 961099 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunctions: Enable/Use service-mesh to reach mw-api

https://gerrit.wikimedia.org/r/961099

Change 961099 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Enable/Use service-mesh to reach mw-api

https://gerrit.wikimedia.org/r/961099

Change 961108 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunctions: Switch all clusters to use the service mesh

https://gerrit.wikimedia.org/r/961108

Change 961108 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Switch all clusters to use the service mesh

https://gerrit.wikimedia.org/r/961108

Etonkovidova subscribed.

Checked all examples in the task ( Z10000, Z10018, and Z10237) in wikifunctions wmf.30 - all results displayed correctly.