Page MenuHomePhabricator

Enable ChangeProp to consume mediawiki.page_content_change.v1
Closed, DeclinedPublic

Description

Decision: The proposal for page_content_change on Kafka Main (option A) was not approved, the ML team proceeded with mediawiki.page_change.v1 instead (option D).

As @Ottomata noted in T401021#11345086, mediawiki.page_content_change.v1 currently exists only in Kafka jumbo-eqiad, while ChangeProp only consumes from Kafka main. As a result, ChangeProp cannot consume from mediawiki.page_content_change.v1 to trigger LiftWing updates for Revise Tone Task Generation.

We have a few options to enable ChangeProp to consume mediawiki.page_content_change.v1, or consume mediawiki.page_change.v1 instead + query page content from MW API. We need to decide which option is best to move forward.


Options from @Ottomata in T401021#11345086:

Option A. Produce mediawiki.page_content_change.v1 to Kafka main

This is my preferred option. I think having access to mediawiki.page_content_change.v1 and other streams like this will be useful for realtime updates for derived data products like this one.

The original reason this was not produced to Kafka main was that SRE was worried about polluting Kafka main with this stream that has large event bodies. Previously, the only user of this stream was for mediawiki_content_change_v1 in the Data Lake, so there was no reason to produce to Kafka main.

We should consider this and talk to SRE ServiceOps to see what they think.

Option B. New change-prop service consuming from Kafka jumbo

Ideally this wouldn't be too hard to do (although I'm not sure its helm chart is in good shape to make this easy). We'd have to figure out where to run it (dse-k8s-eqiad?).

This is my least preferred option. I don't want to deploy more change-props.

Option C. New change-prop rule consuming from Kafka jumbo

This would probably require:

  • A new change-prop route rule (/{api:sys}/queue-jumbo?) declared in the helm chart, like this.
  • Helm chart and helmfile modifications to support consuming from multiple kafka clusters.

If this isn't too hard, this option would be an okay compromise, assuming SRE ServiceOps won't like Option A.

I'm not sure, but I don't think this will require any actual change-prop code changes. Just helm config changes to declare the new routes and kafka configs.

Option D. Consume mediawiki.page_change.v1 instead

This is probably the fastest path to production. LiftWing already responds to mediawiki.page_change.v1 events via change-prop and Kafka main. Doing this for tone check score would mean that the page content would have to be looked up from the MediaWiki API at score time, rather than just getting it out of the page_content_change event body.

This is already done for other models in LiftWing, so perhaps this is easy to do quickly?

I'd prefer to avoid the extra MW API lookups for page content for all of the other LiftWing usages too. All of the other options would allow us to do that.

Event Timeline

I wanted to understand how multi-DC ness relates to all the pieces here. Just writing down what I found:

  • Kafka jumbo-eqiad is only in eqiad
  • LiftWing is only in eqiad (right?)
  • Cassandra is multi DC
  • change-prop is multi DC

Because MediaWiki its active/passive, mediawiki.page_change.v1 will (mostly) only be produced to one of its DC prefixed topics at a time. E.g. if eqiad is active, only the eqiad.mediawiki.page_change.v1 topic will have events produced to it.

change-prop is deployed multi DC in wikikube k8s in eqiad and codfw. In each datacenter, it only consumes from its local DC prefixed topics. E.g. in eqiad change-prop only consumes from eqiad.mediawiki.page_change.v1. It then calls out to LiftWing, which I believe only exists in eqiad(?) Example rule config here: https://gitlab.wikimedia.org/-/snippets/105#L892

LiftWing is only in eqiad, so it will only write to Cassandra in eqiad. (I don't know much about Cassandra Multi-DCness).


So, if we do Option B. or Option C. (allow change-prop to consume from Kafka jumbo), when codfw is the active datacenter, the data flow will go like this:

MediaWiki in codfw 
-> Kafka main-codfw topic codfw.mediawiki.page_change.v1
-> mediawiki_page_content_change_enrich Flink Job in codfw produce
-> |cross DC| Kafka jumbo-eqiad `codfw.mediawiki.page_content_change.v1` topic
-> |cross DC| codfw change-prop consumes `codfw.mediawiki.page_content_change.v1` from Kafka jumbo-eqiad
-> |cross DC| call LiftWing in eqiad
-> LiftWing write to Cassandra in eqiad

That is 3 cross DC hops for this pipeline.

(There may also be a cross DC write when producing weighted tags events from LiftWing too, but I think there isn't. cc @dcausse)


If we do Option A. Produce mediawiki.page_content_change.v1 to Kafka main, this looks like:

Change-Prop and liftwing do:

MediaWiki in codfw 
-> Kafka main-codfw topic codfw.mediawiki.page_change.v1
-> mediawiki_page_content_change_enrich Flink Job in codfw produce
-> Kafka main-codfw `codfw.mediawiki.page_content_change.v1` topic
-> codfw change-prop consumes `codfw.mediawiki.page_content_change.v1` from Kafka main-codfw
-> |cross DC| call LiftWing in eqiad
-> LiftWing write to Cassandra in eqiad

Only one crossing of the DC boundary (because LiftWing is only in eqiad).

LiftWing is only in eqiad (right?)

LiftWing is in both eqiad and codfw

Option C. New change-prop rule consuming from Kafka jumbo

I was looking into this. Would it just be adding the kafka-jumbo brokers (kafka-jumbo101x.eqiad.wmnet) into the broker_list in changeprop's values-eqiad.yaml, without requiring a new rule in the helm chart? @elukey, since we worked on change-prop's chart before, wdyt?

Option A would require some talk with SRE but given the size of the topic and the current /srv usage in main-eqiad / codfw I don't see any big opposition in having the stream hosted there (especially if we advertise that ML will not need to query the mediawiki API as direct consequence for the use case). It would probably be the most clean and reliable option in my opinion.

Option C should work, in theory both Changeprops (eqiad and codfw) will listen/pull from eqiad/codfw prefixed topics without duplicating work, the only downside would be the cross-dc request from say Changeprop codfw to kafka Jumbo (as Andrew anticipated). After checking the helmfile settings I fear that the chart assumes only one kafka cluster to use, so either main or jumbo, but we'd need to dig a little bit more to confirm or deny this assumption :)

Option D is also fine in my opinion, we'll pay an extra call to the mediawiki API but this stream is low volume and we don't really need any strict latency deadlines/perfomance to handle it.

Option A would require some talk with SRE but given the size of the topic and the current /srv usage in main-eqiad / codfw I don't see any big opposition in having the stream hosted there (especially if we advertise that ML will not need to query the mediawiki API as direct consequence for the use case). It would probably be the most clean and reliable option in my opinion.

I agree Option A would be the preferred option. To move forward, what's the process for getting approval from SRE ServiceOps? Who should we talk to? cc @klausman

@Ottomata, what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?

what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?

Yeah, it should just be a helmfile change.

Hi @Joe! The Machine Learning and Growth teams are collaborating on a GrowthExperiments newcomer task for revising tone (associated hypotheses are WE1.1.2 & WE1.1.17).

@Ottomata mentioned that they considered producing mediawiki.page_content_change.v1 to Kafka main but deferred it since use cases were mostly data-lake oriented. We now have a real-time use case and want to revisit that.

Producing mediawiki.page_content_change.v1 to Kafka main would enable real-time updates for derived data products like the tone-suggestion workflow. We prefer this approach since it avoids relying on the MediaWiki API. While the original concern was large event bodies, current topic size and /srv usage suggest it should be fine as @elukey noted in the comment. It's likely the cleanest and most reliable option.

I'd appreciate hearing your thoughts on this. I'm reaching out to you since I found your discussion with the Data Engineering team from two years ago in T330507#8735183.

@Ottomata, what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?

As the PyFlink application already lives in the main K8s cluster, and already reads from the main Kafka cluster, I believe that changing the sink to write into the main Kafka cluster should be enough. It means it will stop writing in Jumbo, start writing in a new topic in Main, and Mirror Maker will copy all new messages into the Jumbo cluster, in the same topic as the PyFlink application is writing now. Offsets should be fine as the application and consumer are the same.

I don't think we can have both versions of PyFlink working at the same time, Mirror Maker would copy messages already processed, making everything duplicated.

Changing the sink in the application shouldn't affect other applications reading the topics in Jumbo either.

If we get approval to move it to Kafka main, I can create the MR with the changes.

@achou, do you have a timeline for this initiative? So we can prioritize it.

If pushing to kafka-main you might need to increase broker's message.max.bytes see T344688.

Would we also need to explicitly create the topics in main? Is auto topic creation enabled there?

Would we also need to explicitly create the topics in main? Is auto topic creation enabled there?

indeed, and even if auto topic creation is enabled we probably want to set the number of partitions to 5 like other "big" topics?

That sounds good. Then, we could consider increasing the partitions in Jumbo too, codfw.mediawiki.page_content_change.v1 and eqiad.mediawiki.page_content_change.v1 both have 3 partitions right now. I'm not sure if any consumer In Jumbo relies on the message ordering that could be affected by the change in partitions, I'm guessing not.

Thank you for the discussion everyone! Reading through, I would suggest proceeding with Option D for the time being. This approach not only unblocks the work without requiring any significant changes to Kafka, but also allows us to observe the workflow in practice and better understand its requirements.
That said, we can later define a set of performance expectations (eg for latency), which will then help us to assess whether any of the other options would provide sufficient benefit to justify any additional efforts. Thoughts?

Hi, thanks all for the input. :) Due to our tight timeline, ML team has decided to move forward with Option D for now.

That said, we can later define a set of performance expectations (eg for latency), which will then help us to assess whether any of the other options would provide sufficient benefit to justify any additional efforts. Thoughts?

I agree! We should follow up on this and revisit the topic in the future. ML team would really like to see this work happen, as we will have other similar use cases that could benefit from mediawiki.page_content_change.v1 in Kafka main.

I'm very happy you're going with the option @jijiki recommended, which sounds like both the path of least resistance and the best option.

Having said that, let me state this for posterity: I don't think that, as it stands, moving such a big topic to kavka-main is an option. If we want to have such fat topics (on top of all the other very busy topics on -main) in that cluster, we need an hardware expansion. I don't think it's an option also because it seems like there's a lot of duplication with the topic already existing on kafka-main.

So if we ever want to move forward with moving that topic, we must reason about why we'd need both page_content_change and page_change as topics, and consolidate the needs in a single flow.

achou updated the task description. (Show Details)