Page MenuHomePhabricator

Enable PCS to send resource change events to handle URL purges
Closed, ResolvedPublic

Description

For compatibility with restbase we need to be able to send resource change events from PCS level in order to invalidate upstream caches.
Currently if restbase detects a change re-emits resource change events but with rewritten fields to refer to the right service.

For more information:
https://github.com/wikimedia/restbase/blob/master/sys/events.js
https://github.com/wikimedia/restbase/blob/master/v1/pcs/stored_endpoint.js#L71

Event Timeline

Change #1040148 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] caching: Add support to send resource change events

https://gerrit.wikimedia.org/r/1040148

Change #1040148 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] caching: Allow PCS to send resource change events

https://gerrit.wikimedia.org/r/1040148

Change #1047959 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] caching: Send events on purge

https://gerrit.wikimedia.org/r/1047959

Change #1047959 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] caching: Send events on purge

https://gerrit.wikimedia.org/r/1047959

After this patch I was expecting staging to publish events on staging eventgate.
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1049530

I am getting timeouts though. Is there a network config that needs to happen to allow traffic?
Here is the error from the http client:

AxiosError: timeout of 120000ms exceeded
    at RedirectableRequest.handleRequestTimeout (/srv/service/node_modules/axios/dist/node/axios.cjs:3143:16)
    at RedirectableRequest.emit (node:events:517:28)
    at Timeout.<anonymous> (/srv/service/node_modules/follow-redirects/index.js:210:12)
    at listOnTimeout (node:internal/timers:569:17)
    at process.processTimers (node:internal/timers:512:7)
    at Axios.request (/srv/service/node_modules/axios/dist/node/axios.cjs:4224:41)

The request is to this URL:

https://staging.svc.eqiad.wmnet:4492/v1/events

For posterity's sake

<claime> nemo-yiannis: I can curl https://staging.svc.eqiad.wmnet:4492/v1/stream-configs from the pod's namespace so I don't think it's networkpolicy related

(and I concur fwiw)

After a lot of debugging even running the same request using node i am still getting timeouts from staging.
Can you try running this cURL request from the mobileapps staging pod?

jgiannelos@deploy1002:~$ cat data.txt
[
    {
      "$schema": "/resource_change/1.0.0",
      "meta": {
        "request_id": "311902cc-0319-41b0-9fc6-9ca1dcb47dd7",
        "id": "68a24380-37a9-11ef-b9d3-87400a6a8e37",
        "dt": "2024-07-01T12:57:01.624Z",
        "domain": "en.wikipedia.org",
        "uri": "en.wikipedia.org/api/rest_v1/page/mobile-html/Dog",
        "stream": "resource_change"
      },
      "tags": [
        "pcs"
      ]
    },
    {
      "$schema": "/resource_change/1.0.0",
      "meta": {
        "request_id": "311902cc-0319-41b0-9fc6-9ca1dcb47dd7",
        "id": "68a24380-37a9-11ef-b9d3-87400a6a8e37",
        "dt": "2024-07-01T12:57:01.624Z",
        "domain": "en.wikipedia.org",
        "uri": "en.wikipedia.org/api/rest_v1/page/mobile-html/Dog",
        "stream": "resource_purge"
      },
      "tags": [
        "pcs"
      ]
    }
  ]
jgiannelos@deploy1002:~$ curl https://staging.svc.eqiad.wmnet:4492/v1/events -H "Content-Type: application/json" -d '@data.txt' | jq .

I am running out of ideas what else could be wrong and the fact that requests time out is a bit suspicious.

root@kubestage1003:/home/cgoubert# curl https://staging.svc.eqiad.wmnet:4492/v1/events -H "Content-Type: application/json" -d '@data.txt' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1359  100   549  100   810   4773   7043 --:--:-- --:--:-- --:--:-- 11817
{
  "invalid": [],
  "error": [
    {
      "status": "error",
      "event": {
        "$schema": "/resource_change/1.0.0",
        "meta": {
          "request_id": "311902cc-0319-41b0-9fc6-9ca1dcb47dd7",
          "id": "68a24380-37a9-11ef-b9d3-87400a6a8e37",
          "dt": "2024-07-01T12:57:01.624Z",
          "domain": "en.wikipedia.org",
          "uri": "en.wikipedia.org/api/rest_v1/page/mobile-html/Dog",
          "stream": "resource_purge"
        },
        "tags": [
          "pcs"
        ]
      },
      "context": {
        "message": "event 68a24380-37a9-11ef-b9d3-87400a6a8e37 of schema at /resource_change/1.0.0 destined to stream resource_purge is not allowed in stream; resource_purge is not configured."
      }
    }
  ]
}

Debugging further, I entered the pod and tried to run openssl s_client -connect staging.svc.eqiad.wmnet:4492 which times out. Hitting the ip directly openssl s_client -connect 10.64.16.55:4492 returns the eventgate TLS cert.

From inside the pod, nodejs DNS lookup returns the ipv6 for staging.svc.eqiad.wmnet

runuser@mobileapps-staging-6794bbd5c6-cp6gd:/srv/service$ nodejs 
Welcome to Node.js v18.19.0.
Type ".help" for more information.
> const dns = require('node:dns')
undefined
> dns.lookup('staging.svc.eqiad.wmnet', (err, address, family) => {
...   console.log('address: %j family: IPv%s', address, family);
... });
GetAddrInfoReqWrap {
  callback: [Function (anonymous)],
  family: 0,
  hostname: 'staging.svc.eqiad.wmnet',
  oncomplete: [Function: onlookup]
}
> address: "2620:0:861:102:10:64:16:55" family: IPv6
``

Change #1051688 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] pcs: Connect to eventgate staging using cluster IP

https://gerrit.wikimedia.org/r/1051688

Change #1051688 merged by jenkins-bot:

[operations/deployment-charts@master] pcs: Connect to eventgate staging using cluster IP

https://gerrit.wikimedia.org/r/1051688

Verified on staging:

curl https://staging.svc.eqiad.wmnet:4102/en.wikipedia.org/v1/page/mobile-html/Dog -H "cache-control: no-cache"

Generates resource change events:

% Reached end of topic staging.resource_change [0] at offset 10323
{"$schema":"/resource_change/1.0.0","meta":{"request_id":"0e4f5da8-e55c-4305-a6f3-5c41f1c616f7","id":"0b1f7220-3928-11ef-9b6e-431f5e0e67b3","dt":"2024-07-03T10:36:01.987Z","domain":"en.wikipedia.org","uri":"en.wikipedia.org/api/rest_v1/page/mobile-html/Dog","stream":"resource_change"},"tags":["pcs"]}

For posterity's sake, a summary follows:

  • wikikube staging doesn't have a very well defined and trodden down path for reaching out to applications deployed in it.
  • We 've CNAMEd staging.svc.eqiad.wmnet to 1 of the nodes in that cluster (which is just 2 nodes anyway) as a hack to provide developers and deployers with some ability to test applications
  • All pods across all kubernetes clusters in the production realm are dual stacked, that is have both IPv4 and IPv6 connectivity
  • When AAAA records started being added to all nodes, the above CNAME got a AAAA record too. This didn't hurt anything talking to it outside the cluster, as nodes do have IPv6 connectivity and no egress filtering
  • Pods do have egress filtering to the rest of production and there is no rule to allow them to talk directly on the IPv6 address of the nodes, nor is such a rule desirable
  • Pods don't have egress filtering for other pods in the same cluster, as we rely on all pods having proper ingress rules.

Given the above, a solution to unblock this was to switch mobileapps from talking to staging.svc.eqiad.wmnet:4492 to talk to eventgate-production-tls-service.eventgate-main.svc.cluster.local.:4492.

For now, this should do, but overall, we should figure out what the role of staging should be in the future and how we can better support such a future.