Page MenuHomePhabricator

Meeting 8 - Fri 17 June 2016 - 12:30 UTC
Closed, ResolvedPublic

Description

Date: 17 June 2016
Time: 12:30 UTC
Type: skype, fallback IRC (channel #gsoc-catimages)

Description: 8th meeting (Week 4) to discuss the projects progress, regarding the midterm evaluation and outcome

Agenda:

Minutes of the Meeting:

  • Make a simple bot and readme - User:AbdealiJK/file-metadata and gist:a94fc0
  • Draft email to commons-l - User:AbdealiJK/file-metadata/Email
  • Upstream
    • dlib - Created the fix for setup.py dlib/136
    • matplotlib - Created PR to suggest installation step matplotlib/6575
    • skimage - Found unusual bug in skimage where the file reading is not giving the expected output - scikit-image/2154
  • Bug fixes -
    • Unicode encodings - There was an issue in the decoding from C libraries pointed out by Zhuyifei c755e7
    • Dict issue - The dict did not handle keys which had a value before and get changed to None. Pointed out by Zhuyifei e11dc9
    • Large decompressed files - Pillow threw a warning for files that had too many pixels. We now warn and ignore this gracefully. Fixed in 19382d
    • zxing small images - ZXing has an issue with very very small files because the first 3 pixel locations were hardcoded. Ignore small images for zxing. Fixed in 1c0de6. Fixed upstream too zxing/607.
    • zxing unsupported image type - ZXing does not support CMYK files. In some cases, exiftool is unsure if a file is RGB or CMYK. Here, just assume it's cmyk and convert it. Fixed in 7fc106
  • New features
  • Bulk tests - I got the bot running in the toolslab server (Zhuyifei wanted to test it there and I thought it'd be a good idea to get it running there). I also got it to run in bulk using toolslab (because Travis' time limit was getting annoying). Mae the logs at ...logs/Category_Images_from_the_State_Library_of_Queensland and .../logs/Category_JPEG_files
    • we have already some very nice results here (few or no false-positives), even no-portraits get detected quite well: https://commons.wikimedia.org/wiki/File:StateLibQld_1_211364_Neil_Cameron.jpg
    • some faces do not get recognized at all - we would like to improve
    • e.g. train the bot with images of persons we now in advance that they will appear in a dataset (e.g. generals or politicians during wars, etc.)
    • for next week implement haarcascade, there is also code for training but unclear how that will procede (let's see ;)
    • may be also train the bot on the dataset itself at least after humans have gone over it
  • T135835: we go for Docker (with help of e.g. Vagrant and VM like VirtualBox) and conda (mainly for win to fulfill deps)
  • MVP:
    • Categorize based on metadata actually needs the bot to write to commons but that is bad idea during beta testing since that might cause chaos, thus for now just print to console the proposed changes to wikipage and improve that continuously until we are stable enough can mass-write
  • we have first progress regarding beta-testers: zhuyifei1999 and 99of9 (thank you very much for your participation! that's exciting!)
    • a bot script was made and runs on toollabs - needs to be documented (e.g. like https://wikitech.wikimedia.org/wiki/DrTrigonBot or better in userspace for now) to get more testers
    • have 2 different bot run modes:
      • auto: be conservative - no false-positive not to annoy commons users with unreliable bot work (and give the maintainer a lot of work to fix stuff)
      • user-maintained: be more experimental - show ALL possible results (no matter how significant) and the user decides which ones are valid
    • brings me to the idea T138119: Use user-maintained bot run mode to gain stats and learn
  • the video copyright project status is very promising (even though a short timeout) and up to be able to name uploaded movies
    • we have to think about integrating it into this project (cross-language python and .NET)
    • after the MVP we will start to work together (hacking session, mettings/communication, etc.)