Page MenuHomePhabricator

[Event Platform] change propagation should discard canary events
Closed, ResolvedPublic

Description

T266798 will enable canary events for all streams. Any change prop rules that listen to MW state change event stream topics should add rules to discard events where meta.domain == 'canary'.

Adding a match_not condition to the rules should suffice, e.g.

match_not:
  - meta:
      domain: canary

It proved easier to add a config switch to have code automatically add this match_not condition to the rule.
https://gerrit.wikimedia.org/r/974267

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

On investigation of the config for changeprop and for changeprop-jobqueue, it might be easier to build canary event filtering feature into change-propagation code itself, rather than copy/pasted into all rules.

Ahoelzl renamed this task from change propagation should discard canary events to [Event Platform] change propagation should discard canary events.Nov 14 2023, 7:21 PM
Ahoelzl edited projects, added Data-Engineering; removed Data-Engineering (Sprint 5).

There seems to be an undocumented cases rule feature which makes it not obvious where to put this match_not rule. Some 'rules' use top level match_not, some uses cases. The seem to be exclusive.

Change 974267 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/services/change-propagation@master] Add discard_canary_events rule option

https://gerrit.wikimedia.org/r/974267

it might be easier to build canary event filtering feature into change-propagation code itself

https://gerrit.wikimedia.org/r/974267

Change 974267 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Add discard_canary_events rule option

https://gerrit.wikimedia.org/r/974267

After merging, I am deploying the new docker image to beta by editing Puppet Configuration for deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud and setting the changeprop.version to version: 2023-11-20-162131-production.

I had to do the same for deployment-docker-cpjobqueue01.

Then I log into those nodes and run puppet via 'sudo puppet agent -t'.

This restarted changeprop and cpjobqueue with the latest docker image that includes auto canary event filtering.

Change 975862 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] changeprop - remove prometheus metrics config for beta values

https://gerrit.wikimedia.org/r/975862

Ah, this failed. I believe I need to run a more up to date docker host node? Trying with bookworm.

I was successful in moving changeprop to a new cloud vps node. I updated the docs at https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep

I was the able to verify that a canary event was skipped, while non canary events were processed.

I have deleted the old deployment-docker-changeprop01 node. cc @elukey

I was not (yet) successful in moving cpjobqueue to a new Cloud VPS node.

I tried making a new deployment-cpjobqueue-1 node to keep things separate. Getting the same error again:

{
  "name": "changeprop",
  "hostname": "d76c60b2411a",
  "pid": 1,
  "level": "FATAL",
  "err": {
    "message": "Unexpected token ,",
    "name": "SyntaxError",
    "stack": "SyntaxError: Unexpected token ,\n    at new Function (<anonymous>)\n    at TAssembly.compile (/srv/service/node_modules/tassembly/tassembly.js:526:11)\n    at new Template (/srv/service/node_modules/swagger-router/lib/reqTemplate.js:413:34)\n    at Router._expandOptions (/srv/service/node_modules/hyperswitch/lib/router.js:119:23)\n    at Router._loadModule (/srv/service/node_modules/hyperswitch/lib/router.js:198:30)\n    at /srv/service/node_modules/hyperswitch/lib/router.js:298:25\n    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Object.gotValue (/srv/service/node_modules/bluebird/js/release/reduce.js:166:18)\n    at Object.gotAccum (/srv/service/node_modules/bluebird/js/release/reduce.js:155:25)\n    at Object.tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromiseCtx (/srv/service/node_modules/bluebird/js/release/promise.js:641:10)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:97:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)",
    "spec": {
      "metadata_broker_list": [
        "deployment-kafka-main-5.deployment-prep.eqiad1.wikimedia.cloud:9092",
        "deployment-kafka-main-6.deployment-prep.eqiad1.wikimedia.cloud:9092"
      ],
      "dc_name": "eqiad",
      "consumer": {
        "fetch.message.max.bytes": 0,
        "log.connection.close": false
      },
      "producer": {
        "compression.codec": null,
        "log.connection.close": false,
        "linger.ms": 20
      },
      "concurrency": 50,
      "startup_delay": 60000,
      "disable_blacklist": true,
      "disable_ratelimit": true
    },
    "tassembly": "{metadata_broker_list:[deployment-kafka-main-5.deployment-prep.eqiad1.wikimedia.cloud:9092,deployment-kafka-main-6.deployment-prep.eqiad1.wikimedia.cloud:9092],dc_name:eqiad,consumer:{fetch.message.max.bytes:0,log.connection.close:false},producer:{,log.connection.close:false,linger.ms:20},concurrency:50,startup_delay:60000,disable_blacklist:true,disable_ratelimit:true}"
  },
  "stack": "SyntaxError: Unexpected token ,\n    at new Function (<anonymous>)\n    at TAssembly.compile (/srv/service/node_modules/tassembly/tassembly.js:526:11)\n    at new Template (/srv/service/node_modules/swagger-router/lib/reqTemplate.js:413:34)\n    at Router._expandOptions (/srv/service/node_modules/hyperswitch/lib/router.js:119:23)\n    at Router._loadModule (/srv/service/node_modules/hyperswitch/lib/router.js:198:30)\n    at /srv/service/node_modules/hyperswitch/lib/router.js:298:25\n    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Object.gotValue (/srv/service/node_modules/bluebird/js/release/reduce.js:166:18)\n    at Object.gotAccum (/srv/service/node_modules/bluebird/js/release/reduce.js:155:25)\n    at Object.tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromiseCtx (/srv/service/node_modules/bluebird/js/release/promise.js:641:10)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:97:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)",
  "levelPath": "fatal/startup",
  "msg": "Message not supplied",
  "time": "2023-11-20T18:02:59.796Z",
  "v": 0
}

Looks like a bug in https://github.com/gwicke/tassembly/blob/master/tassembly.js#L526 with modern NodeJS?

Dunno why we don't see this in prod, maybe only some beta specific config triggers this?

Giving up for this task. reverting cpjobqueue to old image in beta.

Anyway, in summary, I believe we can deploy this in December after I'm back and before we enable canary events for all streams.

Change 975862 merged by Ottomata:

[operations/deployment-charts@master] changeprop - fixes for beta values

https://gerrit.wikimedia.org/r/975862

Anyway, in summary, I believe we can deploy this in December after I'm back and before we enable canary events for all streams.

Looks sane, but I'd involve somebody from ServiceOps to review the use case before proceeding. There may be some subtle issues popping up after the deploy and the more people are aware the better :)

Change 982098 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] changeprop - bump image version to discard canary events

https://gerrit.wikimedia.org/r/982098

Change 982098 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop - bump image version to discard canary events

https://gerrit.wikimedia.org/r/982098

@Ottomata let's send an email to ops@ to alert about this change, it is not big but it is the first one in a while (in changeprop's logic I mean). If anything comes up people will know straight away rather than checking in the logs etc.. (my 2c)

Okay, I'll send one along with the enabling of canary events in general. TY