@diego from the Research team has shared an initial Portuguese Wikipedia dataset on HDFS: /user/dsaez/section_wikidata_item/pt.parquet
Schema (all strings):
- wiki_db: the wiki
- page_wikidata_item: article QID
- page_title: article title
- section_heading: section name
- section_link_item_id: QID of a blue link in the given section, null if none
The blue link relevance score is currently missing.
Look into this dataset and understand how to join it with the image suggestions one, as output by T299789: [XL] Store a list of unillustrated articles with suggested images in hdfs.
Questions we'd like to answer:
- How many sections have images
- what percentage of sections have images
Update
@MunizaA shared an additional dataset at user/mnz/section_images/section_images_2022-04.parquet.
It contains existing images of sections in English and Portuguese Wikipedias as of April 2022.
Schema: item_id: string, wiki_db: string, heading: string, images: array<string>
In other words: QID > wiki > section > existing image file names
Based on this dataset, here are the answers:
wiki | # sections with images | percentage |
enwiki | 1,429,943 | ~6 % |
ptwiki | 221,609 | ~9.1 % |
Investigation
Tracked here: https://gitlab.wikimedia.org/mfossati/section-quids/-/blob/main/visual_qids.ipynb
Conclusion
It looks like we can effectively join the blue-links datasets with the image suggestions ones, and target the section-level image suggestions use case.