Page MenuHomePhabricator

[M] Compute the Wikidata tags update slice for the Commons search index
Closed, ResolvedPublic

Description

User story

As a developer, I need to comply with the Search team's indices update process concerning weighted tags.


when updating we will always provide a "full update" (...)
we always provide the full new state and get rid of the old state. (...)
it would be more resilient if the system could generate the expected data for the set of pages it wants to update, emitting __DELETE_GROUPING__ when it has calculated that the page doesn't have any of the specified tags

Essentially, I think what the script should output is not the updates it wants to make to the cirrus indices, but rather the expected final state of the items it wants to update.

To be implemented in T300045 script:

  • Input = previous run output, current run output
  • Output = update dataframe slice with relevant __DELETE_GROUPING__ values

Simplified example
Previous run:

page_idtagvalues
123p373[ Q1∣25 ]
666p18[ Q42∣999 ]

Current run:

666p18[ Q10000∣27 ]

Output:

page_idtagvalues
123p373[ __DELETE_GROUPING__ ]
666p18[ Q10000∣27 ]

High-level steps

  1. Append .previous to the previous dataframe HDFS parquet
  2. read the the previous dataframe
  3. generate the current one
  4. compute the delta between the two (an option could be through a left anti join, but needs further investigation)

Event Timeline

@EBernhardson , following your feedback in T300045#7708665 and T300045#7708772, am I shaping it right?

Blocked until we find an agreement on the expected output.

CBogen renamed this task from Compute the Wikidata tags update slice for the Commons search index to [M] Compute the Wikidata tags update slice for the Commons search index.Feb 23 2022, 5:33 PM

My understanding of the context:

  • This process generates expected state for all files in commonswiki, will be 100M docs soon enough. For some (many?) the expected state is the empty set. The underlying issue being solved is synchronizing the state between this dataset and elasticsearch.
  • Diffing against the previous run is an optimization to avoid sending updates for all documents. Shipping the full state and letting noop's figure it out on the elastic side could take a day or so to import.

Thoughts:

  • While we generally ship the expected state, shipping a diff seems appropriate in this context.
  • Within search updating a small part or the whole document is the same thing. If some updates will be issued to a page might as well send all the state we know about the page (unless it's inconvenient to share that knowledge to the right places).
  • Ideally we would have you provide the empty array, instead of the single element [__DELETE_GROUPING__] array, and transform it inside the preparation process to avoid leaking internal details. But our previous work with this intermittently had an issue similar to SPARK-25271. A fix is included in 2.4.8, but the cluster currently has 2.4.4.
  • All inputs must specify wiki, page_id and page_namespace. I suspect wiki and page_namespace are constants here? We could put them into the configuration or as a column in the table, whichever works best.
  • The updater accepts two input shapes: Either a row per tag, with the tag specified per-row in a column, or a row per page_id with a configured mapping from column name to tag. Whichever is more convenient to work with can be provided. These would both be valid input formats:
wikipage_namespacepage_idtagvalues
commonswiki6123lead_image_qid[Q123∣7]
commonswiki6123p373[__DELETE_GROUPING__]

or

wikipage_namespacepage_idlead_image_qidp373
commonswiki6123[Q123∣7][__DELETE_GROUPING__]
  • This process generates expected state for all files in commonswiki, will be 100M docs soon enough. For some (many?) the expected state is the empty set. The underlying issue being solved is synchronizing the state between this dataset and elasticsearch.

Correct (not all files actually, just images). Given the December 2021 snapshot, we have roughly 32M docs with non-empty sets. We expect this number to grow due to a small fix needed at dataset gathering time.

  • Diffing against the previous run is an optimization to avoid sending updates for all documents. Shipping the full state and letting noop's figure it out on the elastic side could take a day or so to import.

Agreed. The optimization would take way less (order of magnitude = minutes).

  • Ideally we would have you provide the empty array, instead of the single element [__DELETE_GROUPING__] array, and transform it inside the preparation process to avoid leaking internal details. But our previous work with this intermittently had an issue similar to SPARK-25271. A fix is included in 2.4.8, but the cluster currently has 2.4.4.

Interesting, thanks for pointing that out. I don't remember hitting that issue when writing empty arrays to a parquet, probably because I've never written completely empty columns.
Anyway, I think we can just avoid this with [__DELETE_GROUPING__].

  • All inputs must specify wiki, page_id and page_namespace. I suspect wiki and page_namespace are constants here? We could put them into the configuration or as a column in the table, whichever works best.

Correct. It's already implemented as constant columns in the table.

  • The updater accepts two input shapes: Either a row per tag, with the tag specified per-row in a column, or a row per page_id with a configured mapping from column name to tag. Whichever is more convenient to work with can be provided. These would both be valid input formats:
wikipage_namespacepage_idtagvalues
commonswiki6123lead_image_qid[Q123∣7]
commonswiki6123p373[__DELETE_GROUPING__]

or

wikipage_namespacepage_idlead_image_qidp373
commonswiki6123[Q123∣7][__DELETE_GROUPING__]

The first shape corresponds to the current output: I'll be providing that.

I believe we've reached an agreement, will unblock this task. Thanks again for your thorough feedback, @EBernhardson !

mfossati changed the task status from Open to In Progress.Feb 25 2022, 9:03 AM