Page MenuHomePhabricator

Push wikidata data to the commonswiki_file search index
Closed, DeclinedPublic

Description

User story

As a user of Image Suggestions I want to be able to get an image suggestion (with a confidence score) for an article with a particular wikidata id. To make this possible we need to provide linkages between wikidata ids and the commons files relevant to them in the commonswiki_file search index.


https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb gathers wikidata items relevant for commons search, and stores it

https://github.com/cormacparle/commons_wikidata_links/blob/main/push_data_to_elastic.py is for pushing it to elasticsearch, but it was written as a once-off, and doesn't deal with updating existing data

This ticket is to replicate what push_data_to_elastic.py does, but instead of updating elastic via http we want to write the data to a Hive table and then use Search's airflow process to update the commonswiki_file index. Probably the best place to actually do this is in the refactored script from T300045

Two other things to note while working on this ticket:

  • assuming the wikidata collection code has been modified as described in notebook is modified as described in T300045 then we can safely remove the hardcoding of the user info here 'hdfs:/user/cparle/commons_files_related_wikidata_items' to simply 'commons_files_related_wikidata_items' as the data will live under the platform Airflow username
  • the script is using spark.read.load() to pull down data locally and run computations which the data platform team strongly advises against. Use spark's user defined functions instead Airflow_Coding_Convention#Computation

Event Timeline

Cparle renamed this task from Update script that takes pushes wikidata to the commonswiki search index to Update script that pushes wikidata to the commonswiki search index.Jan 21 2022, 5:43 PM
Cparle created this task.
Cparle renamed this task from Update script that pushes wikidata to the commonswiki search index to Push wikidata data to the commonswiki_file search index.Jan 25 2022, 2:43 PM
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle added subscribers: EBernhardson, Clarakosi.
Cparle updated the task description. (Show Details)
CBogen subscribed.

Doing T302095 instead