Page MenuHomePhabricator

Add an image: generate static file of suggestions
Open, Needs TriagePublic

Description

The first iteration of "add an image" as built by the Growth team will operate off a static file of image suggestions that will be generated once.

Requirements

  • Users have access to all the suggestions (i.e. unillustrated articles with image matches) currently available via the Image Matching Algorithm.
  • Suggestions via MediaSearch should be excluded.
  • If a given article has multiple candidate image suggestions, they should all be available.
  • Articles should be filtered out that fall into the following groups, via the filters already developed in T276137: Exclude unillustrated articles that should not have images:
    • Disambiguation pages
    • Years
    • Lists
    • Redirects
  • Suggestions will not need to be regenerated or updated based on new images in Commons or new data in Wikidata; a static set of suggestions will suffice for Iteration 1.
  • The Growth team will be prioritizing the following wikis, but prefers to load data for all Wikipedias if trivial:
    • Arabic
    • Czech
    • Vietnamese
    • Bengali
    • Spanish
    • Portuguese
    • Persian
    • Turkish
  • After it is generated, the file should be loaded to Hadoop so that the Search team can pick it up to complete T285817: Add an image: load static file to search index. The table needs to minimally contain:
    • wiki
    • page_id
    • namespace

Timing

The Growth team would like this file to be generated (and indexed) close to August 17, to allow for recent data, but still for the data to be available in Search early enough for the team to develop with and test it.

Event Timeline

@MMiller_WMF - is there any indication of how many in the set will be multiple image suggestions for one article? If it is a relatively high proportion, we may want to design the structured task flow to have flexibility of showing multiple options per suggestion to choose from, but I was under the impression that for V1 it would be 1 image suggestion per article.

@MMiller_WMF @Tgr Can you all specify what you're needing with namespace? The namespace is 0 (main namespace) given the algo only looks for pages in the main namespace. If this suffices let me know.

The current export schema is as follows

  • page_id
  • page_title
  • image_id
  • confidence_rating
  • source
  • dataset_id
  • insertion_time
  • wiki
  • found_on

Reference: https://github.com/mirrys/ImageMatching/blob/main/ddl/export_prod_data.hql#L28

I think the namespace is the namespace ID (so, yes, 0) but @dcausse is the better person to ask.

@RHo -- we can ask Miriam for that number, but given that we're unsure whether we'll incorporate the multiple suggestions into our experience, we wanted to err on the side of including them in the API.

@MMiller_WMF @Tgr Can you all specify what you're needing with namespace? The namespace is 0 (main namespace) given the algo only looks for pages in the main namespace. If this suffices let me know.

This is the page namespace (0 for the main space indeed), even if it's always 0 it would be better to still include it as a column in the file to avoid hardcoding this default in multiple places.

Hi @Zbyszko @dcausse,

I have a couple of questions re integration:

  • Do you have a preferred HDFS path where data should be delivered?
  • Do you have a preferred output format? Would Parquet work?
  • How do you plan to read this data? Generally we partition by wiki and date (e.g. wiki=enwiki/d=2021-07-13/somefile.parquet). Is your ingestion partition-aware, or would you rather receive one single dataset? Are there any assumptions/preconditions we should account for?
  • We are still building up scheduling & orchestration capabilities for this pipeline, so on our end we are a bit limited. What would be the best way to coordinate an integration test?

Happy to discuss further.

@gmodena to inject data in the pipeline we have two ways I think:

  • use event-gate and propagate the data with a schema like https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/revision/recommendation-create or similar, the mediawiki.revision.recommendation-create topic is already consumed by the search data-pipeline but I doubt you'll have all the data it requires as it is designed from MW schema fragments. Creating a new simpler schema/topic is possible but perhaps not worth the effort nor very interesting since the data you have is already in a place that we should be able to read.
  • use a hive table with a schema similar to what we already read (wikiid, page_id, page_namespace, recommendation_type + date partition) to inject to the search indices. One unscheduled airflow dag could be created to fetch a specific partition that you would have populated. Transforming this to be more regular will then just be a matter of scheduling this dag.

Since the data already sits in hdfs I believe it makes more sense for the search data pipeline to read a table that you would populate. If you were to move this process out of the analytics network then the event approach might make more sense perhaps.

@sdkim and I talked about when this is needed, and we think that generating this in mid-August, say August 17, will work fine for the Platform team. It would also potentially be possible to re-generate closer to Growth's deployment near the end of September.

[...]

  • use a hive table with a schema similar to what we already read (wikiid, page_id, page_namespace, recommendation_type + date partition) to inject to the search indices. One unscheduled airflow dag could be created to fetch a specific partition that you would have populated. Transforming this to be more regular will then just be a matter of scheduling this dag.

@MMiller_WMF @MPhamWMF The generated sample data using the schema mentioned above can be found under clarakosi.search_imagerec. Please let me know if you have any issues accessing it or if changes are needed to the hive table.

[...]

  • use a hive table with a schema similar to what we already read (wikiid, page_id, page_namespace, recommendation_type + date partition) to inject to the search indices. One unscheduled airflow dag could be created to fetch a specific partition that you would have populated. Transforming this to be more regular will then just be a matter of scheduling this dag.

@MMiller_WMF @MPhamWMF The generated sample data using the schema mentioned above can be found under clarakosi.search_imagerec. Please let me know if you have any issues accessing it or if changes are needed to the hive table.

This looks great, thanks! I'll workup the config and verify it all works as expected against this dataset.

Did a test run, this looks to fit in and process as expected. Once we have a final dataset we should be able to ship it.

Did a test run, this looks to fit in and process as expected. Once we have a final dataset we should be able to ship it.

Nice! The final dataset is now available at clarakosi.search_imagerec. Let me know if you run into any issues.

We've also updated the API with the most recent data.

Change 713554 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Configuration for imagerec data shipping

https://gerrit.wikimedia.org/r/713554

Change 713554 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Configuration for imagerec data shipping

https://gerrit.wikimedia.org/r/713554

Change 713706 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Fully enable imagerec data shipping

https://gerrit.wikimedia.org/r/713706

Change 713706 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Fully enable imagerec data shipping

https://gerrit.wikimedia.org/r/713706

Everything has shipped, recommendations are now available

Everything has shipped, recommendations are now available

Thanks @EBernhardson! The results look good at a glance, although the associated API doesn't seem to support huwiki ("Unable to find a wikiId for property wikipedia and language hu" - not necessarily a problem, it won't be used for iteration 1).

Change 713868 had a related patch set uploaded (by Nikki Nikkhoui; author: Nikki Nikkhoui):

[mediawiki/services/image-suggestion-api@master] Add more supported languages

https://gerrit.wikimedia.org/r/713868

Change 713868 merged by jenkins-bot:

[mediawiki/services/image-suggestion-api@master] Add more supported languages

https://gerrit.wikimedia.org/r/713868

The API has been updated with more languages, the complete list of language codes available now is:

  • ar
  • ceb
  • en
  • pt
  • he
  • ru
  • tr
  • bn
  • de
  • uk
  • cs
  • fr
  • vi
  • arz
  • es
  • eu
  • fa
  • hu
  • hy
  • ko
  • pl
  • ru
  • tr
  • it

If performance becomes a problem please let us know and we can remove some unused languages! (the massive in-memory db causes a little latency)