Page MenuHomePhabricator

convert_to_esbulk: Ship cirrussearch updates to non-content indices
Closed, ResolvedPublic

Description

Today the CirrusSearch data pipeline can only ship updates to the content indexes. Future work with ores predictions requires the ability to ship to draft namespaces. Essentially this script will need to source a mapping from namespace to index (likely with unlisted namespaces going to the general index) for all wikis, and then generate updates with the appropriate index listed.

When doing this we will start shipping popularity score updates to non-content indices, before we allowed those to turn into noops at the indexing stage. To keep the status quo convert_to_esbulk will also need to throw away the non-content updates for popularity_score.

Event Timeline

Change 607597 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Include concrete namespace mapping in config dump

https://gerrit.wikimedia.org/r/607597

Attached patch is only the first step, rough plan:

  • Expose concrete mapping from wiki + namespace to index name, so the approrpiate index can be chosen
  • Ensure namespace is available for all data to ship.
    • Add namespace_id to popularity_score table, namespace is available in existing wmf.pageview_hourly input
    • Add namespace_id to ores_articletopic table, namespace is available in existing event.mediawiki_revision_score input
    • Both tables need to be migrated
  • Write small script to fetch the concrete wiki + namespace mapping from production api's and store somewhere in hdfs
    • We could integrate into the next step, but it seems useful to isolate tasks that will reach out to production API's as we have to provide them appropriate configuration to access outside the analytics network, and we want to limit the amount of code that has possiblities of reaching outside analytics.
  • Update convert_to_esbulk.py to accept a json formatted file containing all wikis namespace mapping and use that mapping to choose index names in bulk indexing outputs.
EBernhardson triaged this task as Medium priority.

Change 607597 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Include concrete namespace mapping in config dump

https://gerrit.wikimedia.org/r/607597

Gehel closed this task as Resolved.Mon, Nov 9, 12:54 PM