This task is to confirm that the project MVP has been reached.
DRAFT MVP:
Background
---------
There are around ~11k uploads to Wikimedia Commons per day [1]
and about 10%-20% are uncategorised after 6 months, based on data
since February 2016 [2] - generally this value can vary from 5-100%
on a daily basis.
pywikibot-catfiles helps reduce categorisation workload for humans by automatically determining
categories for uploaded files based on metadata in the file and Computer Vision techniques.
MVP
---
pywikibot-catfiles MVP user requirements:
# will process files from any pywikibot 'page generator'
# has a non-write mode that logs its simulated actions to the wiki,
and writes a report about its effective in adding categories
# has a manual mode that allows the user to accept or reject categorisation suggestions
# has an automatic mode that conservatively writes to the wiki, only adding categories it
is very confident are correct
# has documentation for installation suitable for non-technical people on Linux on Python 2
# has documentation regarding dependencies, esp. which platforms are not supported by each
The bot added categories are grouped by difficulty, easy vs hard, with hard being:
* human faces
* faces with glasses
* graphics
* barcodes
* ....
The bot added categories are also grouped into functional groups (https://etherpad.wikimedia.org/p/catimages_buckets):
* file type
* geo
* content ("hard" computer vision; see list above)
> # we should move the etherpad to a wiki page on commons or labs
Additional pywikibot-catfiles MVP functional requirements:
* 100% file type categorisation, with 1% error rate
* 70% categorised correctly in one other category (not file type categories)
* 15% get highly relevant categories added (leaf, template, etc.), with 2% error rate for easy categories
* 8% categorized in every bucket
10% error rate is expected for the hard categories.
Verification
------
To test the MVP, the tool will be run in automatic non-write mode on all files
uploaded over a recent 72hr period (three consecutive days) that have not been
categorised 24hrs after they were uploaded to Wikimedia Commons, with
report bins for each 24hr.
(Note: T141765 currently prevents using reverse order processing.)
> # Do you imply by this formulation that the bot has to run on files that have been uploaded no longer than 24hrs ago or would be running on e.g. July be fine too? (would that allow us e.g. to choose the 3 day slot approriately such that we just meet the MVP? since that might be onsidered cheating a bit...) Should we include a clause to run it on a slot within the last week only?
> # added "consecutive" and "bins"
These reports will analysed to check that they meet the functional requirements.
Example reports:
* https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/newimages
* https://commons.wikimedia.org/wiki/Special:PrefixIndex/User:AbdealiJKTravis/logs/newimages/
See also
------
[1] https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm#uploader_activity_levels ("Breakdown of uploads")
[2] https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_in_use_in_galleries&subcatfrom=+20160227%0AMedia+needing+categories+as+of+27+February+2016#mw-subcategories