Page MenuHomePhabricator

Assess pywikibot-catfiles MVP
Closed, ResolvedPublic

Description

This task is to confirm that the project MVP has been reached.

DRAFT MVP:

Background

There are around ~11k uploads to Wikimedia Commons per day [1]
and about 10%-20% are uncategorised after 6 months, based on data
since February 2016 [2] - generally this value can vary from 5-100%
on a daily basis.

pywikibot-catfiles helps reduce categorisation workload for humans by automatically determining
categories for uploaded files based on metadata in the file and Computer Vision techniques.

MVP

pywikibot-catfiles MVP user requirements:

  1. will process files from any pywikibot 'page generator'
  2. has a non-write mode that logs its simulated actions to the wiki, and writes a report about its effective in adding categories
  3. has a manual mode that allows the user to accept or reject categorisation suggestions
  4. has an automatic mode that conservatively writes to the wiki, only adding categories it is very confident are correct
  5. has documentation for installation suitable for non-technical people on Linux on Python 2
  6. has documentation regarding dependencies, esp. which platforms are not supported by each

The bot added categories are grouped by difficulty, easy vs hard, with hard being:

  • Human Faces
  • Faces with Glasses
  • Graphics
  • Barcodes
  • IT8 Calibration Strip
  • Stereo Cards
  • ...

The bot added categories are also grouped into functional groups (https://etherpad.wikimedia.org/p/catimages_buckets):

  • file type
  • geo
  • content ("hard" computer vision; see list above)

etherpad to be moved a wiki page on commons

Additional pywikibot-catfiles MVP functional requirements:

  • 100% file type categorisation, with 1% error rate
  • 70% categorised correctly in one other category (not file type categories)
  • 15% get highly relevant categories added (leaf, template, etc.), with 2% error rate for easy categories
  • 8% categorized in every bucket

10% error rate is expected for the hard categories.

Verification

To test the MVP, the tool will be run in automatic non-write mode on all files
uploaded over a recent 72hr period (three consecutive days) that have not been
categorised 24hrs after they were uploaded to Wikimedia Commons, with
report bins for each 24hr.
(Note: T141765 currently prevents using reverse order processing.)

These reports will analysed to check that they meet the functional requirements.

Example reports:

See also

[1] https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm#uploader_activity_levels ("Breakdown of uploads")
[2] https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_in_use_in_galleries&subcatfrom=+20160227%0AMedia+needing+categories+as+of+27+February+2016#mw-subcategories

Event Timeline

DrTrigon updated the task description. (Show Details)
DrTrigon updated the task description. (Show Details)
jayvdb updated the task description. (Show Details)
jayvdb updated the task description. (Show Details)
jayvdb updated the task description. (Show Details)
jayvdb updated the task description. (Show Details)
jayvdb updated the task description. (Show Details)