Page MenuHomePhabricator

[ImgRec UCU] Data Collection
Closed, ResolvedPublic

Description

Collect dataset of articles and corresponding images from Wikipedia's featured articles

  • Download text of articles
  • Download images from each article
  • Upload dataset on Kaggle

Follow up tasks:

  • Download "problematic" images with dev version of pywikibot, since pipy version has a defect T236405

Event Timeline

Sent email with found pywikibot bug for some articles (crashes and can't download some images)

Sent email with found pywikibot bug for some articles (crashes and can't download some images)

FYI, Pywikibot uses Github only as a mirror. Wikimedia Phabricator is the issue tracker. See https://www.mediawiki.org/wiki/Manual:Pywikibot and https://www.mediawiki.org/wiki/How_to_report_a_bug

@Aklapper, thank you for the clarification! Will report it properly

Created a truncated dataset for 500 articles(1.3 Gb) - https://www.kaggle.com/jacksoncrow/wiki-articles-multimodal
Full dataset for 5638 articles(14.2 Gb) is still uploading, will follow up with a link

FIled another defect, while finetuning image download T236614

OlehOnyshchcak updated the task description. (Show Details)

Full dataset - https://drive.google.com/open?id=18i0D-N1J18UC1ebT9qbHZegKJQiKba5z
Will post a link with full dataset on Kagle.