Page MenuHomePhabricator

Refactor commons_wikidata_links/gather_data.ipynb notebook as a python script
Closed, ResolvedPublic

Description

Use case

As a developer I need to transform the original notebook used to gather wikidata data relevant to commons images to a python file, so that it can be run via airflow


The notebook https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb gathers wikidata data relevant to commons images and writes it to a parquet file.

We need to transform the script to a python file so it can be run via airflow as follows

  • Transform the jupyter notebook to a script as shown here Airflow_Coding_Convention#Jupyter_Notebooks
  • Remove use of wmfdata and use Spark directly: Airflow_Coding_Convention#Spark
  • Instead of reading from hdfs:/user/gmodena/image_placeholders call the script that generates the placeholder image for the airflow user in the data platform team's Airflow Dag before running the notebook script. This would change the code to simply read from image_placeholders

Event Timeline

Cparle updated the task description. (Show Details)
mfossati changed the task status from Open to In Progress.Feb 2 2022, 10:36 AM

A current output example row is:

page_idreverse_p18reverse_p373lead_image_qids
363592[ Q753512 ][ Q1488729/199 ][ Q1867183/86, Q753512/22 ]

where:

  • page_id is a Commons image ID
  • reverse_p18 is a list of QIDs
  • the other columns are lists of QID/score pairs (actually separated by pipe, a slash here for rendering purposes)

For the sake of T299787, we propose the following update to the current output:

page_idwikinamespaceweighted_tag
363592commonswiki6image.linked.from.wikidata.p18/Q753512
363592commonswiki6image.linked.from.wikidata.p373/Q1488729/199
363592commonswiki6image.linked.from.wikidata.lead_image_qid/Q1867183/86
363592commonswiki6image.linked.from.wikidata.lead_image_qid/Q753512/22

(the score is separated by pipe)

Note that the final reverse_p373 score is actually computed in a separate script, namely the one that pushes data to Elastic Search, see https://github.com/cormacparle/commons_wikidata_links/blob/main/push_data_to_elastic.py#L50.
We may want to include such computation here.

@mfossati this looks great, if we can still make adjustments a format along those lines would better fit the current logic we have in the search pipeline:

page_id (int)wiki (str)namespace (int)tag (str but almost an enum of 3 values here)values (array[str])
363592commonswiki6image.linked.from.wikidata.p18Q753512∣1
363592commonswiki6image.linked.from.wikidata.p373Q1488729∣199
363592commonswiki6image.linked.from.wikidata.lead_image_qidQ1867183∣86, Q753512∣22

Note that the score must be within the range 1 and 1000.

(beware: used ∣ U+2223 for |)

Pinging @EBernhardson who wrote this data pipeline to make sure we don't miss anything important.

@dcausse thanks for the update, that all sounds good! I'll be waiting for @EBernhardson's confirmation before proceeding.

(beware: used ∣ U+2223 for |)

An alternative pipe, interesting!

Additional note: image.linked.from.wikidata.p18 values should always get a score of 1000, since it's an important tag when available.

The above is very close, just slight column name changes necessary. The ingestion code requires a hive table with four (fifth optional, if multiple tags from one table) columns, the names of the first three have to be exact. In particular there is a built in assumption that the transfer script will assemble and whitelist (per-ingestion source) the final set of tags to be shipped. Can also reference discovery.mediawiki_revision_recommendation_create table which is created this way.

  • wikiid - str
  • page_namespace - int
  • page_id - int
  • tag - str - name can be anything, value will be used to specify the tag the values are associated with. If only one tag will be updated this can be omitted (not this use case).
  • values - array[str] - name can be anything. Values must be pre-formatted with |<weight> suffixes.

One other thought about the update process that might have to be taken into account: Weighted tags are updated based on the tag. When updating a page all tags matching any of the provided tags are removed and replaced with the new ones. Unreferenced tags are not changed. Mostly meaning updating image.linked.from.wikidata.p18 values will not clear out image.linked.from.wikidata.p373 values on the same page. To remove tags without replacing them with new values a special sigil must be provided as the only element: __DELETE_GROUPING__. I'm not sure if that will be necessary for this use case or not, possibly.

Hmmm ok so you have no dump-and-reload mechanism? If not we'll have to keep the data from the previous run in order to work out the __DELETE_GROUPING__ part

The system was initially designed with the intention that when updating we will always provide a "full update" for the data the process doing the updating is interested in. Almost everything in cirrus is designed this way, where instead of trying to work out what the old state is and what the new state is, we always provide the full new state and get rid of the old state. I might suggest it would be more resilient if the system could generate the expected data for the set of pages it wants to update, emitting __DELETE_GROUPING__ when it has calculated that the page doesn't have any of the specified tags, although this might run counter to how the data is currently collected.

Essentially, I think what the script should output is not the updates it wants to make to the cirrus indices, but rather the expected final state of the items it wants to update.

The code base will live there. Closing this now, will create a merge request for code review in the new repo.