Page MenuHomePhabricator

[L] Look into preliminary blue links algorithm data for section topics
Closed, ResolvedPublic

Description

@diego from the Research team has shared an initial Portuguese Wikipedia dataset on HDFS: /user/dsaez/section_wikidata_item/pt.parquet

Schema (all strings):

  • wiki_db: the wiki
  • page_wikidata_item: article QID
  • page_title: article title
  • section_heading: section name
  • section_link_item_id: QID of a blue link in the given section, null if none

The blue link relevance score is currently missing.

Look into this dataset and understand how to join it with the image suggestions one, as output by T299789: [XL] Store a list of unillustrated articles with suggested images in hdfs.

Questions we'd like to answer:

  • How many sections have images
  • what percentage of sections have images

Update

@MunizaA shared an additional dataset at user/mnz/section_images/section_images_2022-04.parquet.
It contains existing images of sections in English and Portuguese Wikipedias as of April 2022.

Schema: item_id: string, wiki_db: string, heading: string, images: array<string>
In other words: QID > wiki > section > existing image file names

Based on this dataset, here are the answers:

wiki# sections with imagespercentage
enwiki1,429,943~6 %
ptwiki221,609~9.1 %
NOTE: distinct value counts differ from absolute ones. This may be due to sections with identical names.

Investigation

Tracked here: https://gitlab.wikimedia.org/mfossati/section-quids/-/blob/main/visual_qids.ipynb

Conclusion

It looks like we can effectively join the blue-links datasets with the image suggestions ones, and target the section-level image suggestions use case.

Event Timeline

CBogen renamed this task from Look into preliminary blue links algorithm data for section topics to [L] Look into preliminary blue links algorithm data for section topics.May 4 2022, 4:59 PM
mfossati changed the task status from Open to In Progress.May 23 2022, 10:09 AM

The investigation is done. Moving to code review, I'll discuss the next steps with @diego .

Closing this, next steps discussed during sync meetings with Research.