The update pipeline will have to construct an //update document// that will be used to carry the data to index.
How to model this is not entirely trivial given that the list of fields is not entirely knownThe fields to support are specified [[https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/844075/4/airflow/tests/fixtures/hive_operator_hql/import_cirrus_indexes_init_create_tables.expected | here]] (minus `extra_source`).
CirrusSearch models this using a PHP array and use other configuration resource to inform some update hints (super noop options).
Ideally we would like to avoid having to replicate (or fetch) any configuration of the target wiki and the model should carry all the information needed to manipulate itself, in other words //super_detect_noop// options will have to be modeled as well.
The pipeline aims to support three kinds of updates:
- revision based updates
- content refresh updates (re-renders)
- update fragments (for sidedata such as weighted_tags, pageviews related signals)
The model must support merging all these updates together given a set of rules:
- a revision update can be merged with an update fragment (e.g. a page edit and the corresponding update fragment obtained when ORES does its topic detection)
- two or more update fragments can be combined together
- conflict resolution when e.g. two update fragments attempt to update the same field
The model must support arbitrary types for fields (scalars, arrays, and complex types such as GeoData coordinates)
The model must support being serialized by flink (i.e. a versioned flink serializer might be wise to implement).
The model must support being serialized to JSON using a dedicated schema that will have to be designed.
The model must support being serialized as an elasticsearch update (super_detect_noop updates and delete operations).
The model must support being constructed out of the response of the CirrusSearch API to render its document (T317309).
Note: multiple iterations are expected as we implement the search update pipeline itself and discover corner cases we did not know. This ticket is about implementing a reasonable first iteraction.
- CirrusSearch does not yet produce a that perfectly matches the schema defined [[https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/844075/4/airflow/tests/fixtures/hive_operator_hql/import_cirrus_indexes_init_create_tables.expected | here]].
- a first version of the model is defined and implemented