Date: 22 July 2016
Time: 12:30 UTC
Type: skype, fallback IRC (channel #gsoc-catimages)
Description: 13th meeting (Week 9) to discuss how to find a "sexy" number for MVP
- Last run log: https://commons.wikimedia.org/w/index.php?title=User:AbdealiJKTravis/logs/newimages&oldid=202088732 on NewImages
- Category:JPEG files (89.4301 %)
- Category:Human faces (10.1554 %)
Minutes of the Meeting:
- Discussed on the current achieved rate of categorization:
- Category:JPEG files (89.4301 %): Nice but not to be included into calculation since too easy. Actually the question is why are not 100% of the files categorized according to their mime-type? Errors in the file format or file-metadata code?
- Category:Human faces (10.1554 %): Nice. Can be used, but must not be the only working categorization.
- Category:Unidentified people (9.9482 %): (belongs to the former one)
- Template:Created with Adobe Photoshop (6.8394 %): Templates count as leaf node categories since the are high quality due to the possibility to adopt them with time according to changing needs.
- Category:Graphics (6.8394 %): Has about 1% of error so we can reach 5% here. A good one too.
- Template:GPS EXIF (3.9378 %): The geo tagging categories are very nice, but the percentage is a bit low.
- Category:Black and white photographs (2.7979 %): Same.
- Category:PDF files (2.7979 %): Usually not that high.
- Category:Taken with Canon EOS 700D (2.5907 %): Ok.
- Template:Created with GIMP (2.3834 %): Template again.
- Category:SVG files (2.0725 %): Good one too. We should consider this one a bit more since the amount of SVG is up to increase. We should consider putting https://commons.wikimedia.org/wiki/Category:SVG_created_with_..._templates if possible.
- Number of categories put per file is another important point for quality. Given the fact the we have a lot of file type cats we should look at these values by subtracting 1 for each entry:
- 1 (50.8808 %): These
- 2 (27.4611 %)
- 3 (12.4352 %)
- 4 (6.2176 %)
- 5 (2.2798 %)
- 6 (0.7254 %)
- Summarized we can say that it should be realistic to have 3 cats for 5% of the files (excluding file type cats) as a goal.
- We should have some more indicators and stats, as e.g. an entry stating whether a category is a leaf cat or not.
- The bot needs to be run over representative samples of files, e.g. a whole week of NewImages
- number of uploads per day to commons: ~11k
- number of files in test runs until now: 1k (10% of a days upload activity)
- the bot must be run at least on 1 weekday (We) and 1 weekend day (Sa) for every 10th file in order to get an overview