Page MenuHomePhabricator

Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp
Closed, ResolvedPublic

Description

In T317768 we are trying to split the mediawiki.revision-score stream into multiple ones, and in order to do so we'd need a stream processor. Before attempting Flink or Benthos, we should see if anything can be done effectively with ChangeProp. It would save us a lot of time and we'd also not need to manage a stream processor on our own for these use cases.

The current workflow is explained in T317768's description, so we'll avoid the repetition. We should concentrate on two bits of software though:

The idea is to use ChangeProp to just call Lift Wing when a certain revision-create events is read from Kafka, and let Lift Wing to send the correspondent revision-score event to EventGate. Ideally ores_updates.js will not be needed anymore, to be replaced with something way simpler, hopefully yaml config. The only caveat is 00-main.yaml: ORES knows what scores/models are "compatible"/"needed" for any given rev-id/wiki combination, so we should try to replicate it somewhere else.

The simplest idea could be to have the following:

  • Set up a new stream called mediawiki.revision-score-goodfaith in Mediawiki's config.
  • Add a new workflow/config to ChangeProp to support it, namely:
    • read events from revision-create or page-change.
    • check a list of "supported" wikis in Lift Wing
    • hit or not the Lift Wing's inference endpoint, triggering behind the scenes an event to EventGate.

The alternative is to provide an endpoint on Lift Wing with the list of supported wikis for each model-server, and then use it on the ChangeProp side, but it may require more work (we can start simple and then refine).

Event Timeline

Change 881594 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] WIP changeprop: add revscoring streams

https://gerrit.wikimedia.org/r/881594

Change 881664 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add a new test workflow for Lifting to changeprop's staging

https://gerrit.wikimedia.org/r/881664

Change 881594 merged by Elukey:

[operations/deployment-charts@master] changeprop: add liftwing revscoring streams

https://gerrit.wikimedia.org/r/881594

Change 881664 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add a new test workflow for Lifting to changeprop's staging

https://gerrit.wikimedia.org/r/881664

Change 882654 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: fix uri in liftwing's template

https://gerrit.wikimedia.org/r/882654

Change 882654 merged by Elukey:

[operations/deployment-charts@master] changeprop: fix uri in liftwing's template

https://gerrit.wikimedia.org/r/882654

Change 882668 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: fix liftwing's body settings

https://gerrit.wikimedia.org/r/882668

Change 882668 merged by Elukey:

[operations/deployment-charts@master] changeprop: fix liftwing's body settings

https://gerrit.wikimedia.org/r/882668

After getting some help from Hugh (thanks!) I was able to have some rules processed by Changeprop in staging:

https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s-staging&from=now-1h&to=now

Caveat: to test any event, a manual action is needed - https://wikitech.wikimedia.org/wiki/Changeprop#Testing. In my case I sent an event to the staging.mediawiki.revision-create topic (kafka-main eqiad). I didn't see any sign of Lift Wing being called in the related pod's access log, so something is still not working.

Change 883117 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: change liftwing's test rules

https://gerrit.wikimedia.org/r/883117

Change 883117 merged by Elukey:

[operations/deployment-charts@master] services: change liftwing's test rules

https://gerrit.wikimedia.org/r/883117

Still not able to see an HTTP request on the Lift Wing side (triggered by ChangeProp). I did the following:

  • relaxed the regex on Changeprop's config to match only enwiki (I suspected that the match rule was not acting)
  • checked if any log was emitted by Changeprop (none)
  • checked if the changeprop pods can contact liftwing in staging (they can, verified with nsenter)

Change 883128 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: update liftwing's rule templating

https://gerrit.wikimedia.org/r/883128

Change 883135 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: increase verbosity of changeprop logs in staging

https://gerrit.wikimedia.org/r/883135

Change 883128 merged by Elukey:

[operations/deployment-charts@master] changeprop: update liftwing's rule templating

https://gerrit.wikimedia.org/r/883128

Change 883135 merged by Elukey:

[operations/deployment-charts@master] services: increase verbosity of changeprop logs in staging

https://gerrit.wikimedia.org/r/883135

After a long debugging with Hugh, we figured out the problem and reported in T327805. Hugh sent a patch to change the CA BUNDLE used by changeprop (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883240/), so in theory after deploying it to staging we should start seeing some requests to LiftWing :)

New error:

"Hostname/IP does not match certificate's altnames: Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org. is not in the cert's altnames: DNS:inference-staging.discovery.wmnet, DNS:inference-staging.svc.codfw.wmnet, DNS:inference-staging.svc.eqiad.wmnet"

The good news is that ChangeProp seems correctly calling Lift Wing now! I am not 100% sure about this TLS error since in theory it shouldn't happen, but I am going to investigate :)

Edit: the issue seems to be https://github.com/nodejs/node/issues/37104, so node uses the HTTP Host header when checking the servername during the TLS handshake. This seems a problem not really resolved, so we'll have to workaround it.

Change 883964 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add SANs to the inference endpoints for ml-serve

https://gerrit.wikimedia.org/r/883964

Change 883964 merged by Elukey:

[operations/deployment-charts@master] admin_ng: add SANs to the inference endpoints for mlserve staging

https://gerrit.wikimedia.org/r/883964

Finally I managed to make the ChangeProp -> LiftWing connection work! I'll do more tests and then we should be able to wrap up and productionize everything :)

Change 884311 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update liftwing's test database pattern for changeprop staging

https://gerrit.wikimedia.org/r/884311

Change 884311 merged by Elukey:

[operations/deployment-charts@master] services: update liftwing's test database pattern for changeprop staging

https://gerrit.wikimedia.org/r/884311

Next steps:

  1. Extend https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883964 to the inference prod endpoints.
  2. Deploy with Hugh the new version of Changeprop in production (not with our configs, but containing some fixes to allow our configs to work).

Change 885625 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: promote the inference-staging's TLS SANs to inference

https://gerrit.wikimedia.org/r/885625

Change 885625 merged by Elukey:

[operations/deployment-charts@master] admin_ng: promote the inference-staging's TLS SANs to inference

https://gerrit.wikimedia.org/r/885625

Change 885740 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add missing TLS SAN for the inference endpoint

https://gerrit.wikimedia.org/r/885740

Change 885740 merged by Elukey:

[operations/deployment-charts@master] admin_ng: add missing TLS SAN for the inference endpoint

https://gerrit.wikimedia.org/r/885740

Change 885991 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: refactor match template for liftwing

https://gerrit.wikimedia.org/r/885991

While thinking about the documentation I realized that restricting the regex to database is not good for the input event source, since it may not be there (mediawiki.page.change uses meta.domain). To be flexible I added the possibility to match on a meta field, see https://gerrit.wikimedia.org/r/885991.

We'd need to deploy the change to staging, test it and deploy to prod before closing :)

Change 885991 merged by Elukey:

[operations/deployment-charts@master] changeprop: refactor match template for liftwing

https://gerrit.wikimedia.org/r/885991

Change-prop updated in production, we are now ready to have streams!