This task is to confirm that the project MVP has been reached.
DRAFT MVP:
Background
---------
There are around ~11k uploads to Wikimedia Commons per day [1]
and about 10%-20% are uncategorised after 6 months, based on data
since February 2016 [2] - generally this value can vary from 5-100%
on a daily basis.
pywikibot-catfiles helps reduce categorisation workload for humans by automatically determining
categories for uploaded files based on metadata in the file and Computer Vision techniques.
MVP
---
pywikibot-catfiles MVP user requirements:
# will process files from any pywikibot 'page generator'
# has a non-write mode that logs its simulated actions to the wiki,
and writes a report about its effective in adding categories
# has a manual mode that allows the user to accept or reject categorisation suggestions
# has an automatic mode that conservatively writes to the wiki, only adding categories it
is very confident are correct
# has documentation for installation suitable for non-technical people on Linux on Python 2
# has documentation regarding dependencies, esp. which platforms are not supported by each
The bot added categories are grouped by difficulty, easy vs hard, with hard being:
* human faces
* faces with glasses
* graphics
* barcodes
* ....
The bot added categories are also grouped into functional groups (https://etherpad.wikimedia.org/p/catimages_buckets):
* file type
* geo
* content ("hard" computer vision; see list above)
> etherpad to be moved a wiki page on commons
Additional pywikibot-catfiles MVP functional requirements:
* 100% file type categorisation, with 1% error rate
* 70% categorised correctly in one other category (not file type categories)
* 15% get highly relevant categories added (leaf, template, etc.), with 2% error rate for easy categories
* 8% categorized in every bucket
10% error rate is expected for the hard categories.
Verification
------
To test the MVP, the tool will be run in automatic non-write mode on all files
uploaded over a recent 72hr period (three consecutive days) that have not been
categorised 24hrs after they were uploaded to Wikimedia Commons, with
report bins for each 24hr.
(Note: T141765 currently prevents using reverse order processing.)
These reports will analysed to check that they meet the functional requirements.
Example reports:
* https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/newimages
* https://commons.wikimedia.org/wiki/Special:PrefixIndex/User:AbdealiJKTravis/logs/newimages/
See also
------
[1] https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm#uploader_activity_levels ("Breakdown of uploads")
[2] https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_in_use_in_galleries&subcatfrom=+20160227%0AMedia+needing+categories+as+of+27+February+2016#mw-subcategories