Page MenuHomePhabricator

Model the update document used by the CirrusSearch Update Pipeline
Closed, ResolvedPublic8 Estimated Story Points

Description

The update pipeline will have to construct an update document that will be used to carry the data to index.
The fields to support are specified here (minus extra_source).

CirrusSearch models this using a PHP array and use other configuration resource to inform some update hints (super noop options).
Ideally we would like to avoid having to replicate (or fetch) any configuration of the target wiki and the model should carry all the information needed to manipulate itself, in other words super_detect_noop options will have to be modeled as well.

The pipeline aims to support three kinds of updates:

  • revision based updates
  • content refresh updates (re-renders)
  • update fragments (for sidedata such as weighted_tags, pageviews related signals)

The model must support merging all these updates together given a set of rules:

  • a revision update can be merged with an update fragment (e.g. a page edit and the corresponding update fragment obtained when ORES does its topic detection)
  • two or more update fragments can be combined together
  • conflict resolution when e.g. two update fragments attempt to update the same field

The model must support being serialized by flink (i.e. a versioned flink serializer might be wise to implement).
The model must support being serialized to JSON using a dedicated schema that will have to be designed.
The model must support being serialized as an elasticsearch update (super_detect_noop updates and delete operations).
The model must support being constructed out of the response of the CirrusSearch API to render its document (T317309).

Note: multiple iterations are expected as we implement the search update pipeline itself and discover corner cases we did not know. This ticket is about implementing a reasonable first iteraction.

Caveats:

  • CirrusSearch does not yet produce a that perfectly matches the schema defined here.

AC:

  • a first version of the model is defined and implemented

Event Timeline

dcausse updated the task description. (Show Details)
dcausse updated the task description. (Show Details)
MPhamWMF set the point value for this task to 8.Oct 3 2022, 3:51 PM
Gehel removed the point value for this task.
MPhamWMF set the point value for this task to 8.Oct 3 2022, 3:53 PM

Change 856507 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[schemas/event/primary@master] Provide internal schema for CirrusSearch update-pipeline updates.

https://gerrit.wikimedia.org/r/856507

Leaving this open/waiting until work on search update pipeline is done. Just in case any changes become necessary along the way.

Change 856507 merged by jenkins-bot:

[schemas/event/primary@master] Provide internal schema for CirrusSearch update-pipeline updates.

https://gerrit.wikimedia.org/r/856507