This task is to confirm that the project MVP has been reached.
DRAFT MVP:
Background
---------
There are around ~11k uploads to Wikimedia Commons per day [1]
and about 10%-20% are uncategorised after 6 months, based on data
since February 2016 [2] - generally this value can vary from 5-100%
on a daily basis.
> # put references and moved them down to "See also"; for the intressed readers only
> # added "can vary from" this is NOT meant to be an appology but a background to the more intressed readers to understand supposed discrepancies
pywikibot-catfiles helps reduce categorisation workload for humans by automatically determining
categories for uploaded files based on metadata in the file and Computer Vision techniques.
MVP
---
pywikibot-catfiles MVP user requirements:
# will process files from any pywikibot 'page generator'
# has a non-write mode that logs its simulated actions to the wiki,
and writes a report about its effective in adding categories
# has a manual mode that allows the user to accept or reject categorisation suggestions
# has an automatic mode that conservatively writes to the wiki, only adding categories it
is very confident are correct
> changed to numbering in order to make keeping track or discussion simpler and give kind of priority
The bot added categories are grouped by difficulty, easy vs hard, with hard being:
* human faces
* faces with glasses
* graphics
* barcodes
* ....
The bot added categories are also grouped into functional groups (https://etherpad.wikimedia.org/p/catimages_buckets):
* file type
* geo
* content ("hard" computer vision; see list above)
> # we should move the etherpad to a wiki page on commons or labs
Additional pywikibot-catfiles MVP functional requirements:
* 100% file type categorisation, with 1% error rate
* 70% categorised correctly in one other category (not file type categories)
* 15% get highly relevant categories added (leaf, template, etc.), with 2% error rate for easy categories
* 8% categorized in every bucket
10% error rate is expected for the hard categories.
> I don't like or understand the 0% error rate here - can you elaborate? (mathematically speaking this is a problematic number since it literally means the bot has to do that perfectly)
Verification
------
To test the MVP, the tool will be run in automatic non-write mode on all files
uploaded over a 72hr period (three consecutive days) that have not been
categorised 24hrs after they were uploaded to Wikimedia Commons, with
report bins for each 24hr.
(Note: T141765 currently prevents using reverse order processing.)
> # Do you imply by this formulation that the bot has to run on files that have been uploaded no longer than 24hrs ago or would be running on e.g. July be fine too? (would that allow us e.g. to choose the 3 day slot approriately such that we just meet the MVP? since that might be onsidered cheating a bit...) Should we include a clause to run it on a slot within the last week only?
> # added "consecutive" and "bins"
These reports will analysed to check that they meet the functional requirements.
Example reports:
* https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/newimages
* https://commons.wikimedia.org/wiki/Special:PrefixIndex/User:AbdealiJKTravis/logs/newimages/
See also
------
[1] https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm#uploader_activity_levels ("Breakdown of uploads")
[2] https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_in_use_in_galleries&subcatfrom=+20160227%0AMedia+needing+categories+as+of+27+February+2016#mw-subcategories
Further Benefits
------
List of positive side-effects (not needed by MVP):
* Docker for future development of the code and enable implementation into PAWS
* Support of big upload campaigns/sets possible, see e.g.: https://commons.wikimedia.org/wiki/User:DrTrigon/logs/eth-bib
> # Added this part
> # Do you want to add some words about py3, docu or agreements with OSM?