Weekly reports of progress made for the GSoC project T129611.
== Community Bonding (April 22 - May 22)
- Request a task to create project : T133761 ( #Pywikibot-catimages )
- Create Community bonding evaluation subtask: T133692
- Created Weekly Reports Subtask and Community report task
- Created github repo for `file-metadata` - http://github.com/AbdealiJK/file-metadata
- Have set up CI for the github repo
- Made dev pypi package for the github repo https://pypi.python.org/pypi/file-metadata
- I have read through [Life_of_a_successful_project](https://www.mediawiki.org/wiki/Outreach_programs/Life_of_a_successful_project) completely.
- Created a [ToolsLab account](https://wikitech.wikimedia.org/wiki/User_talk:AbdealiJK#Welcome_to_Tool_Labs)
- **Get involved in commons community**:
- Create User page in commons [User:AbdealiJK](https://commons.wikimedia.org/wiki/User:AbdealiJK) and meta.wikimedia [User:AbdealiJK](https://meta.wikimedia.org/wiki/User:AbdealiJK)
- Got some (500+) edits on commons [Special:Contributions/AbdealiJK](https://commons.wikimedia.org/wiki/Special:Contributions/AbdealiJK) (Reached 1000+)
- Sent email to commons, pywikibot, wikimedia lists - https://lists.wikimedia.org/pipermail/pywikibot/2016-May/009452.html
- **Get involved in pywikibot community**:
- Went through and gave input on bug reports related to [isort](https://phabricator.wikimedia.org/T132122), [testing](https://phabricator.wikimedia.org/T115313), [htm_comparator](https://phabricator.wikimedia.org/T134341), [ToolsLab](https://phabricator.wikimedia.org/T134232), [weblinkchecker](https://phabricator.wikimedia.org/T124287), and many more which can be seen in my profile [Phab: AbdealiJK](https://phabricator.wikimedia.org/p/AbdealiJK/). and also filed bugs based on my experience, like [-random](https://phabricator.wikimedia.org/T134720), [toolslab speed](https://phabricator.wikimedia.org/T134232), [proofreadpage test issue](https://phabricator.wikimedia.org/T129965), etc.
- [TODO] Make bot to verify wikimedia logos [T134644](https://phabricator.wikimedia.org/T134644)
- **Get involved with the communities of 3rd party dependencies of catimages**
- Hunt why [music21's travis](https://travis-ci.org/cuthbertLab/music21) fails - The master builds successfully, only dev branches fail, which is fine
- Hunt why [bob's travis](https://travis-ci.org/idiap/bob) fails - Created [idiap/bob#221](https://github.com/idiap/bob/issues/221), it is a non deterministic issue with travis and their deps. Not a code problem and pretty tough to solve myself.
- Check if python 3.3 is supported on cyvlfeat and add to their travis if it is - Created [menpo/cyvlfeat/12](https://github.com/menpo/cyvlfeat/issues/12), cyvlfeat does support 3.3 but their CI utils doesn't.
- Add python3 support to [yaafe and it's travis](https://travis-ci.org/mckelvin/Yaafe) - Created [Yaafe/Yaafe/21](https://github.com/Yaafe/Yaafe/pull/21)
- Add 3.3 support for [matlplotlib's travis](https://travis-ci.org/matplotlib/matplotlib) - This was removed because matplotlib decided to drop support for py3.3 and py2.6 in [254e16925](https://github.com/matplotlib/matplotlib/commit/254e16925644e114cb06ceaf9085196a6de0545d), no plans to add it back
- **Understand catimages better**
- Create list of things done by catimages - https://etherpad.wikimedia.org/p/Zl7V7KuK7J
- Get the catimages script working, atleastsome branches like JPEG, PNG, etc.
- Identify binary files, python packages and other dependencies of catimages. Also, identify whether they have a CI system and make pull requests for Travis if they do not have it. [List is here](https://phabricator.wikimedia.org/T129611)
- Compare ImageAnnotator gadget and rillke's JS template FileContentsByBot [page](https://commons.wikimedia.org/wiki/User:AbdealiJK/Comparison_AnnotationTool_FileContentsByBot)
- Decide on complete project plan with mentors (Decided in meeting 3)
- Create Subtasks for the Project based on above project plan (made in the file-metadata github repo)
- Published report (doing this every 2-3 days)
==== Meetings
(Minutes and Agenda are mentioned in the task related to the Meeting)
- Sun 1 May 2016 - 13:00 UTC : T133763, E172
- Sa 7 May 2016 - 12:30 UTC: T134121, E173
- Fri 13 May 2016 - 12:30 UTC: T134656, google calendar (E178)
- Fri 20 May 2016 - 11:30 UTC: T135230, google calendar
- Mon 23 May 2016 - {T135836}
- Week 1: Fri 27 May 2016 - 12:30 UTC: T135834, google calendar
- Week 2: Fri 3 June 2016 - 12:30 UTC: T136409, google calendar
- Week 3: Fri 10 June 2016 - 12:30 UTC: T136934, google calendar
- {T135835}
- Week 4: Fri 17 June 2016 - 12:30 UTC: T137557, google calendar
- take fanal decision about {T135835} (DrTrigon has to think about that and Docker)
- Summary: we go for Docker (with help of e.g. Vagrant and VM like VirtualBox) and conda (mainly for win to fulfill deps)
- Week 5: Fri 24 June 2016 - 12:30 UTC: T138121, google calendar
- Week 6: Fri 1 July 2016 - 12:30 UTC: T138582, google calendar
- Week 7: Fri 8 July 2016 - 12:30 UTC: T139188, google calendar
- Week 8: Fri 15 July 2016 - 12:30 UTC: T139854, google calendar
- Week 9: Fri 22 July 2016 - 12:30 UTC: T141189, google calendar
- Week 10: Fri 29 July 2016 - 12:30 UTC: T141194, google calendar
== Week 1 (May 23 - May 29)
- **Dependencies on Travis**: I tested a lot of dependencies on travis and got them installed there. I found that dlib would be easier to install with conda because the `pkgconfig.h` wasn't found. Plus, conda has binaries for scipy making it much faster than installing with pip. (Even cyvlfeat is easier with conda...)
- **Framework**: I made a nice OOP framework like how catimages used to work. Classes for each file type. and a generic GenericFile class which everything inherits from. This is split into multiple files to make it easy to read and modular. [45eae3](https://github.com/AbdealiJK/file-metadata/commit/45eae382510b66fecab777e744e98b2c8dd528e1)
- **Exif data**: Read exif data using exiftool. Using json to parse data. [b51228](https://github.com/AbdealiJK/file-metadata/commit/b5122872db3acb4c406945fa3daf1595b4c7cd84)
- **Mimetype**: Use magic for mimetype analysis with backups [1102cf](https://github.com/AbdealiJK/file-metadata/commit/1102cfb8a749661ce8a2be9ebd551b64995114a2)
- {T135836}
- **music21 and ffprobe**: Implemented ffprobe and avprobe support. Now, audio files are completely handled - Duration, rate, streams info, etc. are available.
== Week 2 (May 30 - June 5)
- **Color Info** - Used colormath and pycolorname to get the average color of the given image. [e3f1bb](https://github.com/AbdealiJK/file-metadata/commit/e3f1bbef799d588952717faf4b970d45608ac9dc)
- **Facial landmark** - Got dlib working but used conda to install it on Travis. It is also available in pip, but boost.python is required extra. [82ab50](https://github.com/AbdealiJK/file-metadata/commit/82ab507d3dd7739c5ae00c95c48d975cecee1d84)
- **Barcode and QRCode** - libzbar has not been updated since 2009 and libdmtx has not been updated since 2011. Found zxing which supports more types of barcodes than both zbar and dmtx put together, so I'm using this. It's also much easier to install as it just needs java, and I can auto download a .jar file to run it. ZXing is not active, but it's still being supported - couldn't find anything better out there. [465ea4](https://github.com/AbdealiJK/file-metadata/commit/465ea448c36fa828c0153baa4324b751eb6296c4)
- **Bulk testing**:
- Tried toolslab, but it just hangs my SSH frequently which is troublesome. Then I checked up the dumps which toolslab already has, but they don't have the images ...
- Finally went to the backup we discussed of using Travis. In this, I got the system to work, but it timed out sometimes with a SSLError non-deterministically. Note that travis is really fast a expected, can do 2000+ files within 5 minutes.
- Results at https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs
== Week 3 (June 6 - June 12)
- **Dependencies** - Made https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/Dependencies and also made as much "auto-installation" as possible through out the package:
- **pypi packages** - Make various packages install automatically with pypi if that's supported
- **barcode** analysis needs java. So, [setup.py fails](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/setup.py#L36) if java is not installed. Then, the first time it's run, i [download .jar files](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/file_metadata/image/image_file.py#L126) into a folder and use that.
- **exiftool** - I download the [exiftool perl script](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/file_metadata/generic_file.py#L121) and run that using perl. Hence, [setup.py fails](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/setup.py#L17) if perl is not installed.
- **ffmpeg** - I found static builds (standalone builds) for linux 32/64 bit. I [use these if the platform is linux](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/file_metadata/mixins.py#L36), otherwise I make [setup.py fail](https://github.com/AbdealiJK/file-metadata/blob/14783ce7ad51595eebeda45a7bc918cf04229a7f/setup.py#L45) if avprobe/ffprobe are not found while installing.
- **Removed opencv** - OpenCV is difficult to install and was the only python library not in pypi, so I decided to use scikit-image instead which is easy to install, and also supports Python3. sciikit image can open many more image types like animated GIF, TIFF, etc. And seems to be more useful. It was easy to replace the current usage of opencv with skimage, and I can add opencv back (as an optional dependency) if ever I think it has something skimage does not have.
- ** File conversions** - There were cases where zxing doesnt support certain files, and svg needed to be converted to png, etc. Handle those issues
- **zxing** - If the image in JPEG encoded with CMYK, write a tempfile with RGB data and use that instead. [c8d60a](https://github.com/AbdealiJK/file-metadata/commit/c8d60a074dfc2372e4478284729604902171646e)
- **XCF** - Use imagemagick to convert xcf files. Earlier `convert` was being used with subprocess, but `wand` seems to be better as it's the official python binding. [b1424b](https://github.com/AbdealiJK/file-metadata/commit/b1424b1c6160894cb94b251ef5ad5fd8aa94c593)
- **SVG** - Convert SVG to PNG and then to the normal raster analysis. Use `wand` and imagemagick again to avoid additional dependencies. [4e252c](https://github.com/AbdealiJK/file-metadata/commit/4e252c09751849ac24a32ebadce40416b5ad5a93)
- **Installation** - jayvdb and DrTrigon tried installing file-metadata and found various issues doumented at [AbdealiJK/file-metadata/issues/20#224035714](https://github.com/AbdealiJK/file-metadata/issues/20#issuecomment-224035714]. Various bug reports upstream have been made.
- **Bulk test - image formats** - Done, results can be found at https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs
== Week 4 (June 13 - June 19)
- **Make a simple bot and readme** - [User:AbdealiJK/file-metadata](https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata) and [gist:a94fc0](https://gist.github.com/AbdealiJK/a94fc0d0445c2ad715d9b1b95ec2ba03)
- **Draft email to commons-l** - [User:AbdealiJK/file-metadata/Email](https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/Email)
- **Upstream**
- **dlib** - Created the fix for setup.py [dlib/136](https://github.com/davisking/dlib/pull/136)
- **matplotlib** - Created PR to suggest installation step [matplotlib/6575](https://github.com/matplotlib/matplotlib/pull/6575)
- **skimage** - Found unusual bug in skimage where the file reading is not giving the expected output - [scikit-image/2154](https://github.com/scikit-image/scikit-image/issues/2154)
- **Bug fixes** -
- **Unicode encodings** - There was an issue in the decoding from C libraries pointed out by Zhuyifei [c755e7](https://github.com/AbdealiJK/file-metadata/commit/c755e751fb2259aa5cafdd5d1f6fc097f5698aa7)
- **Dict issue** - The dict did not handle keys which had a value before and get changed to None. Pointed out by Zhuyifei [e11dc9](https://github.com/AbdealiJK/file-metadata/commit/e11dc9c8f499bbe246cbc07f8d249a9b85ba7f5b)
- **Large decompressed files** - Pillow threw a warning for files that had too many pixels. We now warn and ignore this gracefully. Fixed in [19382d](https://github.com/AbdealiJK/file-metadata/commit/19382d46d73b9bde685934faf32dd1d9611f92e5)
- **zxing small images** - ZXing has an issue with very very small files because the first 3 pixel locations were hardcoded. Ignore small images for zxing. Fixed in [1c0de6](https://github.com/AbdealiJK/file-metadata/commit/1c0de640a142a50b819bff12345f3b9ee548be63). Fixed upstream too [zxing/607](https://github.com/zxing/zxing/issues/607).
- **zxing unsupported image type** - ZXing does not support CMYK files. In some cases, exiftool is unsure if a file is RGB or CMYK. Here, just assume it's cmyk and convert it. Fixed in [7fc106](https://github.com/AbdealiJK/file-metadata/commit/7fc106548fc4ccc99b5c2d6e82eeaa3fad1e20f8)
- **New features**
- **zbar support** - Added zbar support [9a8e62](https://github.com/AbdealiJK/file-metadata/commit/9a8e628f69149be0037f89a8c0fe96b733032b01). zbar detects barcodes that zxing does not detect, especially vertical barcodes. It seems to have more false positives too. zbar also has some performance issues: It considerably slower as compared to zxing (3 times slower on my computer) and is considerably more memory intensive (My system hangs when running zbar tests).
- **Created with** - Analyze the exif data and provide a simpler analysis routine by checking all the traces of different softwares. This routine has a curated list of softwares which have a specific key associated with it. [3eb80c](https://github.com/AbdealiJK/file-metadata/commit/3eb80c62237572799dc6d9b179ab0d25771122db)
- **Bulk tests** - I got the bot running in the toolslab server (Zhuyifei wanted to test it there and I thought it'd be a good idea to get it running there). I also got it to run in bulk using toolslab (because Travis' time limit was getting annoying). Mae the logs at [...logs/Category_Images_from_the_State_Library_of_Queensland](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Images_from_the_State_Library_of_Queensland) and [.../logs/Category_JPEG_files](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_JPEG_files)
== Week 5 - Midterm Evaluation (June 20 - June 26)
- **Upstream**:
- **installation using wheels** - I verified that wheels are used when installing all the dependencies except dlib. (This was tested on ToolsLab, on my local computer and Travis). There may be cases when wheels aren't used, but in common(manylinux?) systems they are used.
- **dlib** - Make wheels rather than compiling [dlib/138](https://github.com/davisking/dlib/issues/138)
- **New features**:
- **Haarcascades** - Made haarcascades work [bc55f9](https://github.com/AbdealiJK/file-metadata/commit/bc55f9063a5ad6a699b6fb519af4271974af782b).
- **Line detection** - Made features for line detection to work [723a23](https://github.com/AbdealiJK/file-metadata/commit/723a2322eb08dbdd7a9b66b328c630a599f3190c)
- **Testing**:
- Bulk tests with zbar, zxing, haarcascades, dlib can be found for [JPEG_files](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_JPEG_files) and [SLQ](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Images_from_the_State_Library_of_Queensland).
- Test on files from Category:Faces - Ran on the subset [Male Faces](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Male_faces)
- Test on files from Category:Line_drawings - Ran on [Line drawings](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Line_drawings)
- **Bug fixes**:
- **Disk space clean up** - Added the close() function and also the `__enter__` and `__exit__` to delete the files which are created temporarily while doing analysis (Example: XCF -> PNG, etc.) [6bbca2](https://github.com/AbdealiJK/file-metadata/commit/6bbca26da10f3186d7e2cdc36b8c21b2cc08eacf)
- **Memory leak** - Fixed memory leak because of which when running the bot for more than ~1500 files, it kept giving memory error in ToolsLab (Which has 16GB RAM). [bcb071](https://github.com/AbdealiJK/file-metadata/commit/bcb071907a40fe811f43f7798d1064817b08a39f)
- **TIFF files in zxing** - ZXing cannot handle TIFF files and they were being ignored. Now, we convert it to PNG and use the PNG with zxing to detect barcodes. [4ec26a](https://github.com/AbdealiJK/file-metadata/commit/4ec26a49fa038b7dde525ba7721c99703c48e6e5)
- **Misc**
- **Wikitech** - Made page on wikitech about how to install the bot there [wikitech:User:AbdealiJK/file-metadata](https://wikitech.wikimedia.org/wiki/User:AbdealiJK/file-metadata).
- **Mention Categories** - Made the bot script mention which categories the file should belong to []()
- **Dependencies** - I've removed some dependencies which were very trivial in favour of just making the function myself in file-metadata. The dependencies did not have apt-get packages, so in future if we want to make a .deb package of file-metadata, it becomes easier.
== Week 6 (June 27 - July 3)
- **Bulk tests**
- **line drawing** - Did line drawing test on Male_Faces category [User:AbdealiJKTravis/logs/Category_Male_faces](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Male_faces)
- **Upstream**
- **dlib** - I began working on this, and foudn it to be a major time sink. I needed to understand wheels, and manylinux to be able to make it. I got Matthew Brett who has made many wheels for numpy, scipy, matplotlib, etc. interested in this, so hopefully he can get the wheel part done and I can help out where necessary. [dlib/138](https://github.com/davisking/dlib/issues/138#issuecomment-229653269)
- **pillow** - Ensured that the Pillow version 3.3.0 due on July 1 will have wheels :) [Pillow/1859](https://github.com/python-pillow/Pillow/issues/1859) and [Pillow/1990](https://github.com/python-pillow/Pillow/pull/1990) and [pillow-wheels/36](https://github.com/python-pillow/pillow-wheels/pull/36)
- **Fixes/Features**
- **Dependencies** - Better way to handle dependencies -
- Created a class for each dependency which checks whether it is installed and so on. [c85dd4](https://github.com/AbdealiJK/file-metadata/commit/c85dd41e8616052f24db7837c3be8d67fc079ef0)
- This class provides a better error message if something is not found. Fixes [file-metadata/46](https://github.com/AbdealiJK/file-metadata/issues/46), [c85dd4](https://github.com/AbdealiJK/file-metadata/commit/c85dd41e8616052f24db7837c3be8d67fc079ef0)
- Removed "auto download" of exiftool and ffmpeg, as downloading binaries should be done by package managers preferably. The new error messages ensure it's installed correctly during pip install. [035163](https://github.com/AbdealiJK/file-metadata/commit/035163378ec8c11a9a5f4c784a89706dcfa932dc)
- Using this, we now have a class for zxing also, which downloads the binary during `pip install`. (ZXing does not have a apt-get, yum, etc. package. It's only on maven, and we don't want users to install maven ...) [a19904](https://github.com/AbdealiJK/file-metadata/commit/a19904b3a1c0697491801c31c626eae4988f0dce)
- **Vehicle detection** - Tried [vehicle_detection_haarcascades](https://github.com/andrewssobral/vehicle_detection_haarcascades) which provides a cascade for cars. The code is in a separate branch [6e7c19](https://github.com/AbdealiJK/file-metadata/commit/6e7c19f7d31bfe8e0d8081288c6a344f1b6b9ef4) for testing. The cascade is not very nice, it can only detect small cars. And if I give something like [File:MERCEDESBENZ600Coupe-C100--2118_3.jpg](https://commons.wikimedia.org/wiki/File:MERCEDESBENZ600Coupe-C100--2118_3.jpg) it detects only false values around the car ... In [File:Södertäljevägen_vinter_2013a_01.jpg](https://commons.wikimedia.org/wiki/File:Södertäljevägen_vinter_2013a_01.jpg) it was able to detect 2 cars correctly. So it seems this cascade is meant for videos, where there is blurring and so on and the car is moving. The training set doesnt seem very good for static good quality images where the car is the main subject.
- **Face detection using Exif data**
- Found images which have exif data from the camera for most of the makes.
- Implemented code [b31a7c](https://github.com/AbdealiJK/file-metadata/commit/b31a7cbb86e9cde27ee40ffcfe0abaee412c0c40) which handles fujifilm, sony, nikon, and panasonic cameras. I found that all except the Sony image's face was incorrectly detected.
- Verified that my code was giving the same results as the original facedetect.pl perl script. It seems the original script was not very correct in it's decoding of face data, or the protocols have changed now... I tried correcting it but I couldn't figure out how to encode the data.
- Conclusion: I think this is not worth it, and pretty difficult to maintain. Probably cameras will come to a standardized way to save this (EXIF may add it to their specifications?) as it gets more popular. We could think about doing it then ...
- **RGBA handling** - Earlier, we used to simply strip the alpha channel out. This is incorrect. as some logo png files use alpha to darken or undarken a color. I came across [this SO question](http://stackoverflow.com/questions/2049230/convert-rgba-color-to-rgb) and now I assume rgba images are in a white background and convert it to RGB using the appropriate formula. This makes RGBA handling much more sane and intuitive. [c20eca](https://github.com/AbdealiJK/file-metadata/commit/c20eca6257267d48abcbdff3171f4bf345b0849c) Also created an upstream about this [scikit-image/2174](https://github.com/scikit-image/scikit-image/issues/2174).
- **Script**
- Made a CLI script which is the "official bot demo script" as it resides directly in the github repo - no need to download a gist and so on. The console script is called `wikibot-filemeta-simple` [b6921c](https://github.com/AbdealiJK/file-metadata/commit/b6921cd410f5d8fc96d4ad5a91dd6421730dd27c)
- **Mention Categories** - Made an argument `-showcats` which shows th categories based on the analysis rather than the detailed results of the analysis. [9bafa1](https://github.com/AbdealiJK/file-metadata/commit/9bafa1daa68f26630d2906876b61b7f5691dfdfc)
== Week 7 (July 4 - July 10)
- **Bulk tests**
- **Upstream**
- **skimage** - Add feature `rgba2rgb()` in skimage - [scikit-image/2181](https://github.com/scikit-image/scikit-image/pull/2181)
- **Fixes/Features**
- **Softwares** - Detect if software used was GIMP (for PNG/JPEG), gnome-screenshot, Adobe Photoshop photomerge (panoroma), or Paint.NET [b950aa](https://github.com/AbdealiJK/file-metadata/commit/b950aa97a9f180b355fa13b1fd2db03fea16526d)
- **Camera/Scanner** - Detect the camera model (From EXIF) and suggest appropriate "Taken with Camera Model" or "Scanned with <Scanner model>" categories. There is no nice way to distinguish between a camera and a scanner from exif data sadly. So, I had to just check if the category exists using pywikibot. Because of this, if neither categories exist, we suggest "Taken or Created with ..." and let a human figure it out. I tried using EXIF:FileSource which has the values "Digital Camera", "Reflective Printing Scanner" and so on to find the difference between a camera/scanner - but those were rarely there in the images I found and I wasn't able to check if it was reliable. [207de4/tests/bulk.py#L140](https://github.com/AbdealiJK/file-metadata/blob/207de44e23ec9561ecdfde5f1ffea14c7f31bc8e/tests/bulk.py#L140)
- **Docker** - I wasted quite a lot of time on docker with no good result. It seemed easy, but I found way too many issues to repetitively test things:
- Had issues like GPG key became invalid ([docker-brew-ubuntu-core/52](https://github.com/tianon/docker-brew-ubuntu-core/issues/52))
- Minor modifications became troublesome because it had to redownload things (My internet is currently mediocre)
- It took a lot of time to install everything / build everything (although my computer is decently fast)
- I then tried focusing on CentOS as Ubuntu became difficult to cope with, but it has it's own issues like boost not being detected, zbar not found in EPEL7 (but found in EPEL6), etc.)
- I then tried using Travis because my system seemed to be the bottleneck. The travis builds are slow because docker is dist:trusty which doesn't have container support yet. Had to wait nearly an hour sometimes for the build to start !
- **Geolocation** - I was able to get the latitude and longitude from EXIF, and then I had to use this to fetch the city/state/country at that lat/lon.
- I tried [geopy](https://pypi.python.org/pypi/geopy) which is a wrapper for various Map APIs to find the address based on lat/lon. I initially tried OpenStreetMaps and so on, but found GoogleV3 API was the nicest and provided structured address info (i.e. it told what the locality and country was separately). When doing this, the issue is that Google only allows 1500 free API calls.
- I then tried to use the demo [countries](https://github.com/che0/countries) which does this using shape files of the country borders from [thematicmapping.org](http://thematicmapping.org/downloads/world_borders.php). This requires some heavy dependencies like osgeo+GDAL to do the mathematic computations for shape calculations. GDAL is not easy to install inside a virtualenv ... so, I've left it for now.
- Currently looking into using `nominatim` from OSM directly as it does provide structured data (geopy was not fetching the structured data)
== Week 8 (July 11 - July 17)
- **Bulk tests**
- Ran [JPEG_files](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/JPEG_files) with the new bulk script which shows stats and so on too.
- Ran [-newfiles](https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/newfiles) with the new bulk script which shows stats.
- Newfiles is quite unusual, becaue it seems majority of the files are very large (frequently 300mb tiff files). This makes the analysis and downloading very slow.
- There are race conditions in newfiles. I do a `if (page.exists()): download_page(page)` and there seems to be a race condition where the page may be deleted between the exist() check and downloading. This caused pywikibot's NoPage exceptions twice.
- Some files in the analysis will appear as `[[:]]` because they have eben deleted after being analyzed, and a bot removes all of the files references. For example File:Cyprien.jpg.
- **Upstream**
- **matplotlib** - The installation help message for missing dependencies got merged for matplotlib [matplotlib/6575](https://github.com/matplotlib/matplotlib/pull/6575).
- **Fixes/Features**
- **Scripts**:
- Made the bulk script an "official script" (i.e. it's packaged with pip) so users can run that themselves. It's mainly meant for statistics of file-meta like an integration test. (NOTE: This has not yet been pushed to master though)
- Added some stats to the bulk script like histogram of number of categories found per file
- Added options for -showinfo:XYZ with args "cats", "info", "all" - which the user can decide whether to show all detailed info info, only show the categories predicted, or both from the analysis.
- Added -limitsize:XYZ argument which decides what the maximum size of the file should be in MegaBytes. This helps with not wasting time and bandwidth downloading large files.
- **zxing** - Show error when zxing is unable to open file with java.io as this is a problem with the file. This is done with logging.error()
- **GenericFile.create()** - Handle djvu files correctly, some were being wrongly detected as images and giving errors when trying tobe opened by skimage.
- **Nominatim** - I tried using nominatim, and it worked fine. Currently I'm directly querying http://nominatim.openstreetmap.org without any wrapper for python as it's simple enough. The issue is that nominatim.openstreetmap is not meant for geocoing as mentioned in it's TnC and they only allow a maximum of 1 query / minute to ensure that nobody abuses it.
- A pypi package named `nominatim` does exist, but it was like a 100 line package which can just be done by us. Also, the package is 2 years old and has no travis builds, etc.
- I also asked on wikimaps-l about the nominatim API https://lists.wikimedia.org/pipermail/maps-l/2016-July/001497.html
== Week 9 (July 18 - July 24)
- **Bulk tests**
- The template "Bar charts" had a limit of 25 rows. There are people attempting to make it dynamic on wikipedia, but it's not done yet. I've temporarily made [User:AbdealiJK/Templates/Bar_chart](https://commons.wikimedia.org/wiki/User:AbdealiJK/Templates/Bar_chart) temporarily which can handle 100 rows.
- **Upstream**
- **skimage** - `rgba2rgb()` got merged - [scikit-image/2181](https://github.com/scikit-image/scikit-image/pull/2181)
- **Fixes/Features**
- In bulkbot - Add the -dry argument to print the output to the terminal if needed.
- **Monochrome** - Added Black and White detection based on mean square error thresholding over the color of every pixel. bulk_bot now adds this category to the images. Sepia is not as easy so I'v left it for the time being.
- **Unidentified People** - Add "Category:Unidentified People" to all faces that are detected. (or a leaf of it as applicable by Location detection, The country level is always added even if it's a redlink)
- **Groups of People** - Add "Category:Groups of People" when there are more than 3 faces detected in a file. (or a leaf of it as applicable by Location detection. The country level is always added even if it's a redlink)
- **Transparency** - Add "Category:Transparent Background" if the image has an alpha channel and also the alpha channel is less than 255 at some point (i.e. it's not always opaque even though it has an alpha channel)
- **dlib faces** - Added a threshold on score for dlib detected faces (0.045)
- **Football kits** - Added subcategories of Category:Football kit templates based on file size as it is very standard.
- **Chemical compounds**
- If the image was created with chemtool, add it to Category:Chemical compounds as chemtool is only used to make compound structures.
- Earlier we discussed that if there are only C, H, N, and O in text of the SVG it probably is a chemical compound. But it seems the tools users are using to make these SVG aren't using the <text> tag in SVG, rather they use a <g> tag to make a path (elliptical path for O, etc). This makes it difficult to detect the compounds (I checked about 50 images randomly sampled).
- **Locations** -
- Add "Template:GPS EXIF" when the GPS coordinates is found.
- Also, if the location is found, use that and add Category:<City> or Category:<State> or Category:<Country> to the image (The smaller region is preferred). This is because sometimes the city name isn't valid and complicated cases can arise - for example name clashes like "Category:Punjab, India" vs "Category:Punjab, Pakistan". In which case it's probably better for a human to check it out.
- Also try to use <City>, <State> (Example: Hew Haven, Connecticut) or <City>, <Country> or <State>, <Country> (Example: Punjab, India) if they exist.
- **Glasses** - Tried to detect the subglasses / glasses on a face. But found that the haarcascade frequently detects normal eyes without glasses too. Also, the detection of glasses/eyes is fairly inaccurate and we need to rely on the eye detection code normally. I've run it on more than 300 files now and it has never given a positive or even a false positive ... I don't think it is really useful.
- **Color calibration bars** - This is in regard to the color bar jayvdb pointed out in [Category:Robert_N._Dennis_collection_of_stereoscopic_views](https://commons.wikimedia.org/wiki/Category:Robert_N._Dennis_collection_of_stereoscopic_views). It seems this strip is not specific to stereoscopic views. It is rather a scanning technique for [color calibration](https://en.wikipedia.org/wiki/Color_calibration#Calibration_techniques_and_procedures). I could not find a category which contains all images with this so I'm currently adding "Category:Scans with IT8 target" to these images. I've tried 3-4 different possible algorithms and found that they had a pretty high error rate:
- Find intensity profile over the top/bottom of the image and find number of jumps. If the jumps are ~20 it would be a strip. Also, have appropriate conditions to check if the
- Try Dynamic Time Warping to detect whether the function is similar to floor(x) - but this does not incorporate the intensity (Y) error which varies a lot from image to image
- Ensure that the bar is black and white to manke the other methods robust - but there is actually a lot of color variation in the bar when doing this.
- Attempt to detect segments using SLIC, Quickshift, and Felzenswab's algorithm to detect segments and then use these to find the average color. But these segmentation techniques over segment the image and so there is no easy data that can be found from it.
- I also attempted basic mean and median filtering to reduce error.
- **Stereo cards** - Stereo cards are another thing we could detect. They were popular and hence are frequent in museums collections. It has 2 images which look very similar next to each other.
- Tried doing a simple image detection routine but most of them are pixel by pixel comparisons
- Attempted to use something similar to Spatial Pyramid Pooling which pixelizes the image and then does pixel by pixel Mean Square Error calculation
- Tried Histogram based image recognition as similar images would have similar histograms, but here again the MSE of the histograms is not distinguishable from normal images
== Week 10 (July 25 - July 31)
== Week 11 (Aug 1 - Aug 7)
== Week 12 (Aug 8 - Aug 14)
== Week 13 (Aug 15 - Aug 23)