This task is to confirm that the project MVP has been reached.
DRAFT MVP:
Background
There are around ~11k uploads to Wikimedia Commons per day [1]
and about 10%-20% are uncategorised after 6 months, based on data
since February 2016 [2] - generally this value can vary from 5-100%
on a daily basis.
pywikibot-catfiles helps reduce categorisation workload for humans by automatically determining
categories for uploaded files based on metadata in the file and Computer Vision techniques.
MVP
pywikibot-catfiles MVP user requirements:
- will process files from any pywikibot 'page generator'
- has a non-write mode that logs its simulated actions to the wiki, and writes a report about its effective in adding categories
- has a manual mode that allows the user to accept or reject categorisation suggestions
- has an automatic mode that conservatively writes to the wiki, only adding categories it is very confident are correct
- has documentation for installation suitable for non-technical people on Linux on Python 2
- has documentation regarding dependencies, esp. which platforms are not supported by each
The bot added categories are grouped by difficulty, easy vs hard, with hard being:
- Human Faces
- Faces with Glasses
- Graphics
- Barcodes
- IT8 Calibration Strip
- Stereo Cards
- ...
The bot added categories are also grouped into functional groups (https://etherpad.wikimedia.org/p/catimages_buckets):
- file type
- geo
- content ("hard" computer vision; see list above)
etherpad to be moved a wiki page on commons
Additional pywikibot-catfiles MVP functional requirements:
- 100% file type categorisation, with 1% error rate
- 70% categorised correctly in one other category (not file type categories)
- 15% get highly relevant categories added (leaf, template, etc.), with 2% error rate for easy categories
- 8% categorized in every bucket
10% error rate is expected for the hard categories.
Verification
To test the MVP, the tool will be run in automatic non-write mode on all files
uploaded over a recent 72hr period (three consecutive days) that have not been
categorised 24hrs after they were uploaded to Wikimedia Commons, with
report bins for each 24hr.
(Note: T141765 currently prevents using reverse order processing.)
These reports will analysed to check that they meet the functional requirements.
Example reports:
- https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/newimages
- https://commons.wikimedia.org/wiki/Special:PrefixIndex/User:AbdealiJKTravis/logs/newimages/
See also
[1] https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm#uploader_activity_levels ("Breakdown of uploads")
[2] https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_in_use_in_galleries&subcatfrom=+20160227%0AMedia+needing+categories+as+of+27+February+2016#mw-subcategories