Page MenuHomePhabricator

Implement the ingestion job
Closed, ResolvedPublic8 Estimated Story Points

Description

The ingestion job (cirrus-streaming-updater-consumer) should read messages from a kafka topic and write to an elasticsearch index.

Messages from the kafka topic should comply with the schema defined at https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/856507.
Writing to elasticsearch could be assisted with the elasticsearch connector.

The main function will be to create the bulk requests:

  • create a scripted update request similar to what's done in CirrusSearch for revision based updates
  • create delete request for page deletes.

AC:

  • a new flink job can be scheduled consuming a topic of update document and writing to a elasticsearch cluster
  • updates can filtered per-wiki based on a command line parameter (to ease testing)

Related Objects

Event Timeline

Gehel triaged this task as High priority.Nov 21 2022, 4:25 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 8.Nov 21 2022, 4:53 PM

Change 860516 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[search/cirrus-streaming-updater@master] Consume internal CirrusSearch updates

https://gerrit.wikimedia.org/r/860516

Change 864733 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[search/cirrus-streaming-updater@master] Make sure code runs on Java 1.8

https://gerrit.wikimedia.org/r/864733

Change 864788 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[search/cirrus-streaming-updater@master] Make kafka sink configurable

https://gerrit.wikimedia.org/r/864788

Change 864733 merged by jenkins-bot:

[search/cirrus-streaming-updater@master] Make sure code runs on Java 1.8

https://gerrit.wikimedia.org/r/864733

Change 864788 merged by jenkins-bot:

[search/cirrus-streaming-updater@master] Make kafka sink configurable

https://gerrit.wikimedia.org/r/864788

Change 871227 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[search/cirrus-streaming-updater@master] Downgrade Caffeine to 2.9.3 for compatibility with Java 8

https://gerrit.wikimedia.org/r/871227

Change 871227 merged by jenkins-bot:

[search/cirrus-streaming-updater@master] Downgrade Caffeine to 2.9.3 for compatibility with Java 8

https://gerrit.wikimedia.org/r/871227

Change 860516 merged by jenkins-bot:

[search/cirrus-streaming-updater@master] Consume internal updates and map them to elasticsearch requests

https://gerrit.wikimedia.org/r/860516