Page MenuHomePhabricator

Consume ORES articletopic data from Kafka and store it in HDFS
Closed, InvalidPublic

Description

ORES articletopic scores pushed into Kafka via T240549: Configure ORES to publish new drafttopic scores to Kafka need to get to HDFS somehow, and new scores for the same page need to override old scores. Presumably EventGate can handle this.

Open questions:

  • Index via title or page ID? The first is less efficient and less robust, but would allow cross-wiki lookups, which might or might not be needed while articletopic is enwiki-only.
  • Do we need a mechanism to get rid of data for deleted pages?

Event Timeline

The relevant conversation happened in T240549: Configure ORES to publish new drafttopic scores to Kafka; I read it as saying that this task is actually a no-op, since the mechanism (which uses EventGate with the mediawiki/revision/score schema and stores data in the event.mediawiki_revision_score hive table) is already score type agnostic and will handle whatever score types ORES outputs when its precache endpoint gets called.

Halfak added a subscriber: Halfak.

We just deployed a change that will make it so that a mediawiki/revision/score event is produced every time an article is edited. So that should show up in the stream and eventually HDFS.

Do we need a mechanism to get rid of data for deleted pages?

@Tgr and @Halfak , did this question get resolved? If not, I can open a separate ticket for it.

Additionally, @Halfak are you and your team able to do a QA check on the data flow into HDFS, or would you like us to see if we can do it from our side before closing out the ticket?

kostajh renamed this task from Consume ORES drafttopic data from Kafka and store it in HDFS to Consume ORES articletopic data from Kafka and store it in HDFS.Feb 6 2020, 11:18 AM
kostajh updated the task description. (Show Details)

As per above, this turned out to be a no-op.