Weekly reports of progress made for the GSoC project T129611.
== Community Bonding (April 22 - May 22)
- Request a task to create project : T133761 ( #Pywikibot-catimages )
- Create Community bonding evaluation subtask: T133692
- Created Weekly Reports Subtask and Community report task
- Created github repo for `file-metadata` - http://github.com/AbdealiJK/file-metadata
- Have set up CI for the github repo
- Made dev pypi package for the github repo https://pypi.python.org/pypi/file-metadata
- I have read through [Life_of_a_successful_project](https://www.mediawiki.org/wiki/Outreach_programs/Life_of_a_successful_project) completely.
- Created a [ToolsLab account](https://wikitech.wikimedia.org/wiki/User_talk:AbdealiJK#Welcome_to_Tool_Labs)
- **Get involved in commons community**:
- Create User page in commons [User:AbdealiJK](https://commons.wikimedia.org/wiki/User:AbdealiJK) and meta.wikimedia [User:AbdealiJK](https://meta.wikimedia.org/wiki/User:AbdealiJK)
- Got some (500+) edits on commons [Special:Contributions/AbdealiJK](https://commons.wikimedia.org/wiki/Special:Contributions/AbdealiJK) (Reached 1000+)
- Sent email to commons, pywikibot, wikimedia lists - https://lists.wikimedia.org/pipermail/pywikibot/2016-May/009452.html
- **Get involved in pywikibot community**:
- Went through and gave input on bug reports related to [isort](https://phabricator.wikimedia.org/T132122), [testing](https://phabricator.wikimedia.org/T115313), [htm_comparator](https://phabricator.wikimedia.org/T134341), [ToolsLab](https://phabricator.wikimedia.org/T134232), [weblinkchecker](https://phabricator.wikimedia.org/T124287), and many more which can be seen in my profile [Phab: AbdealiJK](https://phabricator.wikimedia.org/p/AbdealiJK/). and also filed bugs based on my experience, like [-random](https://phabricator.wikimedia.org/T134720), [toolslab speed](https://phabricator.wikimedia.org/T134232), [proofreadpage test issue](https://phabricator.wikimedia.org/T129965), etc.
- [TODO] Make bot to verify wikimedia logos [T134644](https://phabricator.wikimedia.org/T134644)
- **Get involved with the communities of 3rd party dependencies of catimages**
- Hunt why [music21's travis](https://travis-ci.org/cuthbertLab/music21) fails - The master builds successfully, only dev branches fail, which is fine
- Hunt why [bob's travis](https://travis-ci.org/idiap/bob) fails - Created [idiap/bob#221](https://github.com/idiap/bob/issues/221), it is a non deterministic issue with travis and their deps. Not a code problem and pretty tough to solve myself.
- Check if python 3.3 is supported on cyvlfeat and add to their travis if it is - Created [menpo/cyvlfeat/12](https://github.com/menpo/cyvlfeat/issues/12), cyvlfeat does support 3.3 but their CI utils doesn't.
- Add python3 support to [yaafe and it's travis](https://travis-ci.org/mckelvin/Yaafe) - Created [Yaafe/Yaafe/21](https://github.com/Yaafe/Yaafe/pull/21)
- Add 3.3 support for [matlplotlib's travis](https://travis-ci.org/matplotlib/matplotlib) - This was removed because matplotlib decided to drop support for py3.3 and py2.6 in [254e16925](https://github.com/matplotlib/matplotlib/commit/254e16925644e114cb06ceaf9085196a6de0545d), no plans to add it back
- **Understand catimages better**
- Create list of things done by catimages - https://etherpad.wikimedia.org/p/Zl7V7KuK7J
- Get the catimages script working, atleastsome branches like JPEG, PNG, etc.
- Identify binary files, python packages and other dependencies of catimages. Also, identify whether they have a CI system and make pull requests for Travis if they do not have it. [List is here](https://phabricator.wikimedia.org/T129611)
- Compare ImageAnnotator gadget and rillke's JS template FileContentsByBot [page](https://commons.wikimedia.org/wiki/User:AbdealiJK/Comparison_AnnotationTool_FileContentsByBot)
- Decide on complete project plan with mentors (Decided in meeting 3)
- Create Subtasks for the Project based on above project plan (made in the file-metadata github repo)
- Published report (doing this every 2-3 days)
==== Meetings
(Minutes and Agenda are mentioned in the task related to the Meeting)
- Sun 1 May 2016 - 13:00 UTC : T133763, E172
- Sa 7 May 2016 - 12:30 UTC: T134121, E173
- Fri 13 May 2016 - 12:30 UTC: T134656, google calendar (E178)
- Fri 20 May 2016 - 11:30 UTC: T135230, google calendar
- Mon 23 May 2016 - {T135836}
- Week 1: Fri 27 May 2016 - 12:30 UTC: T135834, google calendar
- Week 2: Fri 3 June 2016 - 12:30 UTC: T136409, google calendar
- Week 3: Fri 10 June 2016 - 12:30 UTC: T136934, google calendar
- {T135835}
- Week 4: Fri 17 June 2016 - 12:30 UTC: T137557, google calendar
- take fanal decision about {T135835} (DrTrigon has to think about that and Docker)
- Summary: we go for Docker (with help of e.g. Vagrant and VM like VirtualBox) and conda (mainly for win to fulfill deps)
- Week 5: Fri 24 June 2016 - 12:30 UTC: T138121, google calendar
== Week 1 (May 23 - May 29)
- **Dependencies on Travis**: I tested a lot of dependencies on travis and got them installed there. I found that dlib would be easier to install with conda because the `pkgconfig.h` wasn't found. Plus, conda has binaries for scipy making it much faster than installing with pip. (Even cyvlfeat is easier with conda...)
- **Framework**: I made a nice OOP framework like how catimages used to work. Classes for each file type. and a generic GenericFile class which everything inherits from. This is split into multiple files to make it easy to read and modular. [45eae3](https://github.com/AbdealiJK/file-metadata/commit/45eae382510b66fecab777e744e98b2c8dd528e1)
- **Exif data**: Read exif data using exiftool. Using json to parse data. [b51228](https://github.com/AbdealiJK/file-metadata/commit/b5122872db3acb4c406945fa3daf1595b4c7cd84)
- **Mimetype**: Use magic for mimetype analysis with backups [1102cf](https://github.com/AbdealiJK/file-metadata/commit/1102cfb8a749661ce8a2be9ebd551b64995114a2)
- {T135836}
- **music21 and ffprobe**: Implemented ffprobe and avprobe support. Now, audio files are completely handled - Duration, rate, streams info, etc. are available.
== Week 2 (May 30 - June 5)
- **Color Info** - Used colormath and pycolorname to get the average color of the given image. [e3f1bb](https://github.com/AbdealiJK/file-metadata/commit/e3f1bbef799d588952717faf4b970d45608ac9dc)
- **Facial landmark** - Got dlib working but used conda to install it on Travis. It is also available in pip, but boost.python is required extra. [82ab50](https://github.com/AbdealiJK/file-metadata/commit/82ab507d3dd7739c5ae00c95c48d975cecee1d84)
- **Barcode and QRCode** - libzbar has not been updated since 2009 and libdmtx has not been updated since 2011. Found zxing which supports more types of barcodes than both zbar and dmtx put together, so I'm using this. It's also much easier to install as it just needs java, and I can auto download a .jar file to run it. ZXing is not active, but it's still being supported - couldn't find anything better out there. [465ea4](https://github.com/AbdealiJK/file-metadata/commit/465ea448c36fa828c0153baa4324b751eb6296c4)
- **Bulk testing**:
- Tried toolslab, but it just hangs my SSH frequently which is troublesome. Then I checked up the dumps which toolslab already has, but they don't have the images ...
- Finally went to the backup we discussed of using Travis. In this, I got the system to work, but it timed out sometimes with a SSLError non-deterministically. Note that travis is really fast a expected, can do 2000+ files within 5 minutes.
- Results at https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs
== Week 3 (June 6 - June 12)
- **Dependencies** - Made https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/Dependencies and also made as much "auto-installation" as possible through out the package:
- **pypi packages** - Make various packages install automatically with pypi if that's supported
- **barcode** analysis needs java. So, [setup.py fails](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/setup.py#L36) if java is not installed. Then, the first time it's run, i [download .jar files](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/file_metadata/image/image_file.py#L126) into a folder and use that.
- **exiftool** - I download the [exiftool perl script](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/file_metadata/generic_file.py#L121) and run that using perl. Hence, [setup.py fails](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/setup.py#L17) if perl is not installed.
- **ffmpeg** - I found static builds (standalone builds) for linux 32/64 bit. I [use these if the platform is linux](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/file_metadata/mixins.py#L36), otherwise I make [setup.py fail](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/setup.py#L45) if avprobe/ffprobe are not found while installing.
- **Removed opencv** - OpenCV is difficult to install and was the only python library not in pypi, so I decided to use scikit-image instead which is easy to install, and also supports Python3. sciikit image can open many more image types like animated GIF, TIFF, etc. And seems to be more useful. It was easy to replace the current usage of opencv with skimage, and I can add opencv back (as an optional dependency) if ever I think it has something skimage does not have.
- ** File conversions** - There were cases where zxing doesnt support certain files, and svg needed to be converted to png, etc. Handle those issues
- **zxing** - If the image in JPEG encoded with CMYK, write a tempfile with RGB data and use that instead. [c8d60a](https://github.com/AbdealiJK/file-metadata/commit/c8d60a074dfc2372e4478284729604902171646e)
- **XCF** - Use imagemagick to convert xcf files. Earlier `convert` was being used with subprocess, but `wand` seems to be better as it's the official python binding. [b1424b](https://github.com/AbdealiJK/file-metadata/commit/b1424b1c6160894cb94b251ef5ad5fd8aa94c593)
- **SVG** - Convert SVG to PNG and then to the normal raster analysis. Use `wand` and imagemagick again to avoid additional dependencies. [4e252c](https://github.com/AbdealiJK/file-metadata/commit/4e252c09751849ac24a32ebadce40416b5ad5a93)
- **Installation** - jayvdb and DrTrigon tried installing file-metadata and found various issues doumented at [AbdealiJK/file-metadata/issues/20#224035714](https://github.com/AbdealiJK/file-metadata/issues/20#issuecomment-224035714]. Various bug reports upstream have been made.
- **Bulk test - image formats** - Done, results can be found at https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs
== Week 4 (June 13 - June 19)
- **Make a simple bot and readme** - [User:AbdealiJK/file-metadata](https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata) and [gist:a94fc0](https://gist.github.com/AbdealiJK/a94fc0d0445c2ad715d9b1b95ec2ba03)
- **Draft email to commons-l** - [User:AbdealiJK/file-metadata/Email](https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/Email)
- **Upstream**
- **dlib** - Created the fix for setup.py [dlib/136](https://github.com/davisking/dlib/pull/136)
- **matplotlib** - Created PR to suggest installation step [matplotlib/6575](https://github.com/matplotlib/matplotlib/pull/6575)
- **skimage** - Found unusual bug in skimage where the file reading is not giving the expected output - [scikit-image/2154](https://github.com/scikit-image/scikit-image/issues/2154)
- **Bug fixes** -
- **Unicode encodings** - There was an issue in the decoding from C libraries pointed out by Zhuyifei [c755e7](https://github.com/AbdealiJK/file-metadata/commit/c755e751fb2259aa5cafdd5d1f6fc097f5698aa7)
- **Dict issue** - The dict did not handle keys which had a value before and get changed to None. Pointed out by Zhuyifei [e11dc9](https://github.com/AbdealiJK/file-metadata/commit/e11dc9c8f499bbe246cbc07f8d249a9b85ba7f5b)
- **Large decompressed files** - Pillow threw a warning for files that had too many pixels. We now warn and ignore this gracefully. Fixed in [19382d](https://github.com/AbdealiJK/file-metadata/commit/19382d46d73b9bde685934faf32dd1d9611f92e5)
- **zxing small images** - ZXing has an issue with very very small files because the first 3 pixel locations were hardcoded. Ignore small images for zxing. Fixed in [1c0de6](https://github.com/AbdealiJK/file-metadata/commit/1c0de640a142a50b819bff12345f3b9ee548be63). Fixed upstream too [zxing/607](https://github.com/zxing/zxing/issues/607).
- **zxing unsupported image type** - ZXing does not support CMYK files. In some cases, exiftool is unsure if a file is RGB or CMYK. Here, just assume it's cmyk and convert it. Fixed in [7fc106](https://github.com/AbdealiJK/file-metadata/commit/7fc106548fc4ccc99b5c2d6e82eeaa3fad1e20f8)
- **New features**
- **zbar support** - Added zbar support [9a8e62](https://github.com/AbdealiJK/file-metadata/commit/9a8e628f69149be0037f89a8c0fe96b733032b01). zbar detects barcodes that zxing does not detect, especially vertical barcodes. It seems to have more false positives too. zbar also has some performance issues: It considerably slower as compared to zxing (3 times slower on my computer) and is considerably more memory intensive (My system hangs when running zbar tests).
- **Created with** - Analyze the exif data and provide a simpler analysis routine by checking all the traces of different softwares. This routine has a curated list of softwares which have a specific key associated with it. [3eb80c](https://github.com/AbdealiJK/file-metadata/commit/3eb80c62237572799dc6d9b179ab0d25771122db)
- **Bulk tests** - I got the bot running in the toolslab server (Zhuyifei wanted to test it there and I thought it'd be a good idea to get it running there). I also got it to run in bulk using toolslab (because Travis' time limit was getting annoying). Mae the logs at [...logs/Category_Images_from_the_State_Library_of_Queensland](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Images_from_the_State_Library_of_Queensland) and [.../logs/Category_JPEG_files](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_JPEG_files)
== Week 5 - Midterm Evaluation (June 20 - June 26)
- **Upstream**:
- **installation using wheels** - I verified that wheels are used when installing all the dependencies except dlib. (This was tested on ToolsLab, on my local computer and Travis). There may be cases when wheels aren't used, but in common(manylinux?) systems they are used.
- **dlib** - Make wheels rather than compiling [dlib/138](https://github.com/davisking/dlib/issues/138)
- **New features**:
- **Haarcascades** - Made haarcascades work [bc55f9](https://github.com/AbdealiJK/file-metadata/commit/bc55f9063a5ad6a699b6fb519af4271974af782b).
- **Line detection** - Made features for line detection to work [723a23](https://github.com/AbdealiJK/file-metadata/commit/723a2322eb08dbdd7a9b66b328c630a599f3190c)
- **Testing**:
- Bulk tests with zbar, zxing, haarcascades, dlib can be found for [JPEG_files](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_JPEG_files) and [SLQ](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Images_from_the_State_Library_of_Queensland).
- Test on files from Category:Faces - Ran on the subset [Male Faces](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Male_faces)
- Test on files from Category:Line_drawings - Ran on [Line drawings](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Line_drawings)
- **Bug fixes**:
- **Disk space clean up** - Added the close() function and also the `__enter__` and `__exit__` to delete the files which are created temporarily while doing analysis (Example: XCF -> PNG, etc.) [6bbca2](https://github.com/AbdealiJK/file-metadata/commit/6bbca26da10f3186d7e2cdc36b8c21b2cc08eacf)
- **Memory leak** - Fixed memory leak because of which when running the bot for more than ~1500 files, it kept giving memory error in ToolsLab (Which has 16GB RAM). [bcb071](https://github.com/AbdealiJK/file-metadata/commit/bcb071907a40fe811f43f7798d1064817b08a39f)
- **TIFF files in zxing** - ZXing cannot handle TIFF files and they were being ignored. Now, we convert it to PNG and use the PNG with zxing to detect barcodes. [4ec26a](https://github.com/AbdealiJK/file-metadata/commit/4ec26a49fa038b7dde525ba7721c99703c48e6e5)
- **Misc**
- **Wikitech** - Made page on wikitech about how to install the bot there [wikitech:User:AbdealiJK/file-metadata](https://wikitech.wikimedia.org/wiki/User:AbdealiJK/file-metadata).
- **Mention Categories** - Made the bot script mention which categories the file should belong to []()
- **Dependencies** - I've removed some dependencies which were very trivial in favour of just making the function myself in file-metadata. The dependencies did not have apt-get packages, so in future if we want to make a .deb package of file-metadata, it becomes easier.
== Week 6 (June 27 - July 3)
== Week 7 (July 4 - July 10)
== Week 8 (July 11 - July 17)
== Week 9 (July 18 - July 24)
== Week 10 (July 25 - July 31)
== Week 11 (Aug 1 - Aug 7)
== Week 12 (Aug 8 - Aug 14)
== Week 13 (Aug 15 - Aug 23)