Meeting 13 - Fri 22 July 2016 - 12:30 UTC
Closed, Resolved


Date: 22 July 2016
Time: 12:30 UTC
Type: skype, fallback IRC (channel #gsoc-catimages)

Description: 13th meeting (Week 9) to discuss how to find a "sexy" number for MVP


Minutes of the Meeting:

  • Discussed on the current achieved rate of categorization:
    • Category:JPEG files (89.4301 %): Nice but not to be included into calculation since too easy. Actually the question is why are not 100% of the files categorized according to their mime-type? Errors in the file format or file-metadata code?
    • Category:Human faces (10.1554 %): Nice. Can be used, but must not be the only working categorization.
    • Category:Unidentified people (9.9482 %): (belongs to the former one)
    • Template:Created with Adobe Photoshop (6.8394 %): Templates count as leaf node categories since the are high quality due to the possibility to adopt them with time according to changing needs.
    • Category:Graphics (6.8394 %): Has about 1% of error so we can reach 5% here. A good one too.
    • Template:GPS EXIF (3.9378 %): The geo tagging categories are very nice, but the percentage is a bit low.
    • Category:Black and white photographs (2.7979 %): Same.
    • Category:PDF files (2.7979 %): Usually not that high.
    • Category:Taken with Canon EOS 700D (2.5907 %): Ok.
    • Template:Created with GIMP (2.3834 %): Template again.
    • Category:SVG files (2.0725 %): Good one too. We should consider this one a bit more since the amount of SVG is up to increase. We should consider putting if possible.
  • Number of categories put per file is another important point for quality. Given the fact the we have a lot of file type cats we should look at these values by subtracting 1 for each entry:
    • 1 (50.8808 %): These
    • 2 (27.4611 %)
    • 3 (12.4352 %)
    • 4 (6.2176 %)
    • 5 (2.2798 %)
    • 6 (0.7254 %)
    • Summarized we can say that it should be realistic to have 3 cats for 5% of the files (excluding file type cats) as a goal.
  • We should have some more indicators and stats, as e.g. an entry stating whether a category is a leaf cat or not.
  • The bot needs to be run over representative samples of files, e.g. a whole week of NewImages
    • number of uploads per day to commons: ~11k
    • number of files in test runs until now: 1k (10% of a days upload activity)
    • the bot must be run at least on 1 weekday (We) and 1 weekend day (Sa) for every 10th file in order to get an overview
  • ...