Refactor commons_wikidata_links/gather_data.ipynb notebook as a python script
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	Jan 25 2022, 2:33 PM

Description

Use case

As a developer I need to transform the original notebook used to gather wikidata data relevant to commons images to a python file, so that it can be run via airflow

The notebook https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb gathers wikidata data relevant to commons images and writes it to a parquet file.

We need to transform the script to a python file so it can be run via airflow as follows

Transform the jupyter notebook to a script as shown here Airflow_Coding_Convention#Jupyter_Notebooks
Remove use of wmfdata and use Spark directly: Airflow_Coding_Convention#Spark
Instead of reading from hdfs:/user/gmodena/image_placeholders call the script that generates the placeholder image for the airflow user in the data platform team's Airflow Dag before running the notebook script. This would change the code to simply read from image_placeholders

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T299781 [EPIC] Image suggestions backend
Resolved	mfossati	T296814 [EPIC] Article-level image suggestions data pipeline
Resolved	mfossati	T300045 Refactor commons_wikidata_links/gather_data.ipynb notebook as a python script

Event Timeline

Cparle created this task.Jan 25 2022, 2:33 PM

Cparle updated the task description. (Show Details)

Cparle mentioned this in T299787: Push wikidata data to the commonswiki_file search index.Jan 25 2022, 2:43 PM

Cparle edited projects, added Structured-Data-Backlog; removed Structured-Data-Backlog (Current Work).Jan 25 2022, 2:57 PM

Cparle updated the task description. (Show Details)Jan 25 2022, 5:35 PM

Cparle mentioned this in T299789: [XL] Store a list of unillustrated articles with suggested images in hdfs.Jan 25 2022, 5:54 PM

Cparle mentioned this in T296814: [EPIC] Article-level image suggestions data pipeline.Jan 25 2022, 6:00 PM

Cparle mentioned this in T299890: [M] Exclude previously rejected image suggestions when generating new suggestions.Jan 25 2022, 6:11 PM

Cparle added a parent task: T296814: [EPIC] Article-level image suggestions data pipeline.Jan 27 2022, 3:17 PM

Cparle claimed this task.Jan 27 2022, 3:24 PM

Cparle edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

Cparle moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.

mfossati subscribed.Jan 31 2022, 4:18 PM

mfossati claimed this task.Feb 1 2022, 11:05 AM

mfossati mentioned this in T299408: Add lead image data to the data file generated for populating weighted_tags in the commons search index.Feb 1 2022, 12:35 PM

mfossati changed the task status from Open to In Progress.Feb 2 2022, 10:36 AM

Progress tracked here: https://github.com/marfox/commons_wikidata_links/tree/T300045

In T300045#7671597, @mfossati wrote:

~~Progress tracked here: https://github.com/marfox/commons_wikidata_links/tree/T300045~~

Moved the script to a new data pipeline for T296814 as per PET's guidelines.
Progress is now at https://gitlab.wikimedia.org/mfossati/platform-airflow-dags/-/blob/T296814-image-suggestions/image-suggestions/pyspark/src/transform.py

Tgr subscribed.Feb 7 2022, 1:36 AM

A current output example row is:

page_id	reverse_p18	reverse_p373	lead_image_qids
363592	[ Q753512 ]	[ Q1488729/199 ]	[ Q1867183/86, Q753512/22 ]

where:

page_id is a Commons image ID
reverse_p18 is a list of QIDs
the other columns are lists of QID/score pairs (actually separated by pipe, a slash here for rendering purposes)

For the sake of T299787, we propose the following update to the current output:

page_id	wiki	namespace	weighted_tag
363592	commonswiki	6	image.linked.from.wikidata.p18/Q753512
363592	commonswiki	6	image.linked.from.wikidata.p373/Q1488729/199
363592	commonswiki	6	image.linked.from.wikidata.lead_image_qid/Q1867183/86
363592	commonswiki	6	image.linked.from.wikidata.lead_image_qid/Q753512/22

(the score is separated by pipe)

Note that the final reverse_p373 score is actually computed in a separate script, namely the one that pushes data to Elastic Search, see https://github.com/cormacparle/commons_wikidata_links/blob/main/push_data_to_elastic.py#L50.
We may want to include such computation here.

@mfossati this looks great, if we can still make adjustments a format along those lines would better fit the current logic we have in the search pipeline:

page_id (int)	wiki (str)	namespace (int)	tag (str but almost an enum of 3 values here)	values (array[str])
363592	commonswiki	6	`image.linked.from.wikidata.p18`	`Q753512∣1`
363592	commonswiki	6	`image.linked.from.wikidata.p373`	`Q1488729∣199`
363592	commonswiki	6	`image.linked.from.wikidata.lead_image_qid`	`Q1867183∣86`, `Q753512∣22`

Note that the score must be within the range 1 and 1000.

(beware: used ∣ U+2223 for |)

Pinging @EBernhardson who wrote this data pipeline to make sure we don't miss anything important.

@dcausse thanks for the update, that all sounds good! I'll be waiting for @EBernhardson's confirmation before proceeding.

(beware: used ∣ U+2223 for |)

An alternative pipe, interesting!

Additional note: image.linked.from.wikidata.p18 values should always get a score of 1000, since it's an important tag when available.

The above is very close, just slight column name changes necessary. The ingestion code requires a hive table with four (fifth optional, if multiple tags from one table) columns, the names of the first three have to be exact. In particular there is a built in assumption that the transfer script will assemble and whitelist (per-ingestion source) the final set of tags to be shipped. Can also reference discovery.mediawiki_revision_recommendation_create table which is created this way.

wikiid - str
page_namespace - int
page_id - int
tag - str - name can be anything, value will be used to specify the tag the values are associated with. If only one tag will be updated this can be omitted (not this use case).
values - array[str] - name can be anything. Values must be pre-formatted with |<weight> suffixes.

One other thought about the update process that might have to be taken into account: Weighted tags are updated based on the tag. When updating a page all tags matching any of the provided tags are removed and replaced with the new ones. Unreferenced tags are not changed. Mostly meaning updating image.linked.from.wikidata.p18 values will not clear out image.linked.from.wikidata.p373 values on the same page. To remove tags without replacing them with new values a special sigil must be provided as the only element: __DELETE_GROUPING__. I'm not sure if that will be necessary for this use case or not, possibly.

Hmmm ok so you have no dump-and-reload mechanism? If not we'll have to keep the data from the previous run in order to work out the __DELETE_GROUPING__ part

The system was initially designed with the intention that when updating we will always provide a "full update" for the data the process doing the updating is interested in. Almost everything in cirrus is designed this way, where instead of trying to work out what the old state is and what the new state is, we always provide the full new state and get rid of the old state. I might suggest it would be more resilient if the system could generate the expected data for the set of pages it wants to update, emitting __DELETE_GROUPING__ when it has calculated that the page doesn't have any of the specified tags, although this might run counter to how the data is currently collected.

Essentially, I think what the script should output is not the updates it wants to make to the cirrus indices, but rather the expected final state of the items it wants to update.

mfossati mentioned this in T302095: [M] Compute the Wikidata tags update slice for the Commons search index.Feb 18 2022, 5:46 PM

In T300045#7675235, @mfossati wrote:

Progress is now at https://gitlab.wikimedia.org/mfossati/platform-airflow-dags/-/blob/T296814-image-suggestions/image-suggestions/pyspark/src/transform.py

Full code base at https://gitlab.wikimedia.org/mfossati/platform-airflow-dags/-/tree/T296814-image-suggestions/image-suggestions/pyspark/src

NOTE: PET's repo has moved to https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines as per T300734.

The code base will live there. Closing this now, will create a merge request for code review in the new repo.

mfossati closed this task as Resolved.Feb 18 2022, 6:06 PM

Refactor commons_wikidata_links/gather_data.ipynb notebook as a python scriptClosed, ResolvedPublicActions

Description

Use case

Related ObjectsSearch...

Event Timeline

Refactor commons_wikidata_links/gather_data.ipynb notebook as a python script
Closed, ResolvedPublic
Actions

Related Objects
Search...