The existing image suggestions data pipeline suggests images at an article level. There is a new data pipeline being built that will suggest images at an article section level.
The output of the new data pipeline is expected to the be the same as the article level suggestions but with the addition of a field to contain the section identifier.
When the new data pipeline is built the existing article level data pipeline will continue to run and the output consumed as it's done currently.
Write Frequency and Method:
- Weekly bulk from Airflow > Cassandra connector job
Size and Growth:
- Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976
- Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)
Access:
- Cassandra Data Gateway
Timeline:
- SD writing data to production by end of year/early Jan 23.
Questions for Data Persistence:
- Based on the above does it make sense to just modify the existing schema to add a new field? I don't think we need new tables.
- I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id
- Do we have a Cassandra test instance that SD engineers could use during their development phases?
Update
This is an example row currently available in Cassandra for article-level image suggestions:
wiki | page_id | id | image | confidence | found_on | kind | origin_wiki | page_rev --------+---------+--------------------------------------+------------------------+------------+----------+-----------------------------+-------------+---------- anwiki | 3326 | 839c1112-97ae-11ed-89f7-bc97e1581854 | 14_Agosto_2016_(1).jpg | 80 | null | {'istype-commons-category'} | commonswiki | 1853430
For Section-Level-Image-Suggestions, we agreed to add the following fields:
- page_qid: string - the page Wikidata ID
- section_index: int - the numeric index as seen by mwparserfromhell
- section_title: string - the section title extracted from wikitext, e.g., Discography. The special string ### zero ### is reserved for lead sections.