Port catimages.py to pywikibot-core
Phabricator task containing the proposal : T129611: [GSoC 2016 Proposal] Port catimages.py to pywikibot-core
- Name : Abdeali J Kothari
- Email : firstname.lastname@example.org
- Github : https://github.com/AbdealiJK
- Wikimedia accounts : Special:CentralAuth/AbdealiJK
- IRC nick : AbdealiJK
- Time Zone : UTC+5:30 (IST - India)
- Typical working time : 9:00am to 6:00pm IST
- Location : Chennai, India
- School / Degree : B.Tech. Engineering Physics at Indian Institute of Technology, Madras. Expected to graduate in - August 2016
The aim of the project is to bring to life the catimages.py script from pywikibot-compat. This involves heavy refactoring of the script. While doing this refactoring, it’d be useful to modularize the script and make it a generic package. The generic package can then be used in pywikibot-core to provide the same functionality as it provided earlier.
- Possible Mentors : DrTrigon (@DrTrigon), John Vandenberg (@jayvdb)
- Languages used : Python majorly, and C/C++ for opencv dependency. Maybe PHP in case we plan to make a mediawiki extension
- Related phabricator issue : The issue this proposal solves is T66838: Port catimages.py to core
To wikimedia: catimages.py brings about automation in categorizing images. Its an invaluable tool to have, which can be extremely accurate considering the recent innovation in Computer Vision (CV). Using catimages.py we can give uploads more meaningful metadata to work with.
To pywikibot: We can bring back automated categorization without manual intervention to pywikibot with this project. Pywikibot already has the scripts imagerecat.py and checkimages.py. imagerecat doesn’t work right now as it uses CommonSense from wikisense which is dead (T60869#1365653 and T78462). And checkimages requires manual input. As catimages attempts to categorize without any prior information (using just the file itself), if done right, it would be easier to use.
To the rest of the world: Wikimedia is all about providing more data to everyone. One awesome outcome of moving catimages.py to pywikibot-core is that all the dependencies either become external pypi packages or get moved upstream to other packages. This is really good as it makes the tool usable by people who would not be able to manually install all of the dependencies ! With the pycolorname package that I developed in the microtask, I was able to make the following package - which is a great resource for developers handling color. Even non-developers can refer to the charts generated by it.
The project can be broadly divided into 3 parts,
- Dependency checking and updating the code to fix deprecated deps and general refactoring.
- Refactoring (use pip where possible)
- Port to pywikibot-core
- Optimizing the script to work better and faster.
- Improve CV code by Optimizing (e.g. replace used algorithms by better ones)
- Add new features
To explain the first part, we need to take a look at all the libraries and tools that catimages.py can use (Refer to the table at T66838). As can be see in the table, the following needs to be done to the dependencies:
- Some packages are usable directly (eg: numpy, scipy) - minimal work required
- Some need to be patched upstream (eg: music21, bob) - not much work, but may need time (3rd party maintainer involved)
- Some need to be packaged and uploaded to pypi (eg: pycolorname, jseg)
- Some need to be replaced as they are deprecated (eg: PIL, pyexiv2)
The second part of optimizing the script catimages.py does not have a concrete game plan right now. This will be decided over the course of the project. I have worked on similar domains at my university. Possible algorithms that could be used:
- r-cnn (2014, Girschick et. al.) and for faster machines this may be a good idea.
- SPP-net (2015, Shaoqing Ren et. al.) is a faster version of rcnn but it takes longer to train. This may be beneficial in certain scenarios though.
- LCKSVD is another algorithm I’ve used in the past able to get not-so-great results, but it works very fast.
One simple idea should be to provide a good interface with VLFeat - a library of Computer Vision algorithms specializing in image understanding. we can integrate vlfeat into the package and use the common algorithms directly. Although these algorithms are old (most are published in 2012 or earlier) they are useful and the library itself is flexible. This has the benefit that we will not have to maintain the algorithm specific code. Also, VLFeat which was created to maintain a set of these algorithms will probably be better at it.
This may not only be about the algorithm, but also using other data sets. In specific I would like to try using the ILSVRC (or ImageNet) dataset to improve classification accuracy. The benefit of using this dataset is that it is an order of magnitude larger than Pascal (detailed comparison can be found here), implying we have more supervised training.
Community Bonding (April 22 - May 22)
The two major aspects of the project that should be completed here are:
- Hack catimages.py (and deps) to be usable without pywikibot on a personal github repository. This is to get the basic functionality usable and testable.
- Understanding the math behind catimages.py. The algorithm and logic used in the script can be refactored to use newer method from opencv, sklearn, and scipy which are now optimized and more stable.
- Find methods to optimize catimages.py using new techniques.
Week 1 (May 23 - May 29)
The aim in Week 1 is handling all the modules which have patches. The reason to do this first is because it involves 3rd party developers who may take time to reply if upstream changes are needed. This includes the modules: bob, jseg, music21, xbob-flandmark.
What needs to be done is check out why the patches have been applied. Identify whether the patches are still needed and do the appropriate changes upstream or inside catimages.py. Interestingly, most of these patches have small changes which simply add paths to sys.path or specify args. Hence, this can be solved within a week.
Week 2 (May 30 - June 5)
The aim for Week 2 is to replace unsupported packages and make pypi packages for packages which currently rely on archives (like .zip). Unsupported packages are: zbar and pyexiv2. Archives are: yaafelib, slic, jseg, and jseg/jpeg-6b.
We need to move these into github repositories and deploy to pypi if needed. And then update catimages.py to use their new versions.
Measurable outcome: run- and testable first core bot script
Week 3 (June 6 - June 12)
In week 3, it would be a good idea to pause and review the code written so far, fixing any minor issues and so on that can occur with the code. This is also a buffer week incase any work from earlier weeks is still pending. I’d also like to write a blog post about my experience so far.
Week 4 (June 13 - June 19)
In week 4, the aim is to begin revamping the opencv code. First off cv is deprecated and cv2 needs to be used instead. Also, the latest cv2 python bindings are are not backward compatible and need to be updated. This would give me good understanding of the Computer Vision part in catimages.py.
Week 5 - Midterm Evaluation (June 20 - June 26)
The opencv package is a little problematic. It has custom C++/C code (lots of it) which needs to be refactored. Hence, I’ll be using my knowledge (from last week and community bonding) about how sklearn and OpenCV can clean up the opencv package.
Hence, for the midterm evaluation I plan to have a pypi package which can perform all the functions that catimages.py could. This would be independant of pywikibot-core, probably just as a simple package. Let’s call this pypi package pypi-catimages to avoid confusion.
Measurable outcome: Have a pypi package "pypi-catimages" which can perform a minimal set of the functions that catimages.py could: Detect faces using haarcascades, Categorize based on metadata, Barcode and QR Code detection, Graphics detection. (milestone?)
Week 6 (June 27 - July 3)
By Week 6 I plan to begin integrating the prior created pypi package with pywikibot. This would be to use cli arguments from pywikibot, use the page-generators in core, and update all pywikibot related functionality.
Not quite sure where this script will be. Possible approaches are:
- pywikibot-core and pypi-catimages are requirements for another pypi package pywikibot-catimages for it to be used with mediawiki.
- A script is created in pywikibot-core which uses the pypi-catimages and handles the args required for the script to be used with mediawiki.
Measurable outcome: The outcome here would be a Pull Request / Patch Set which adds functionality to interface with pywikibot to the pypi-catimages. [ok; I think we are quite close - did we miss or negelect something until here we have finish work on?]
Week 7 (July 4 - July 10)
In Week 7, again, a pause for testing and review of code being pushed to pywikibot-core. I would want to write unit tests here and make sure the PR gets reviewed and accepted. This would be a good time to make a blog post about how catimages.py was ported so that other scripts that still reside in compat can refer to the methods I used. This is also a buffer week. In case one of the above weeks did not go according to plan, this is the week to fix this problem.
Measurable outcome: fully working core bot script that has equivalent functionallity as old compat bot - clearly indicate what functionally is missing and why [very ambisious; do we want to have a measureable after a buffer week? needs to be before week 8... - the blog post should mention what are differences/what is missing to, kind of contain a todo list, after this week the bot should be able to add the proper category tags, e.g. for uncategorized files]
Week 8 (July 11 - July 17)
Week 8 marks the beginning of the second part of GSoC and the second stage of this project too. Here, I would like to do some research on computer vision. Probably recent literature on classification. It would be good to use some newer CV techniques with various pros/cons. We could bundle this into pypi-catimages and allow the user to choose which algorithm to use. [I'm not sure whether we should allow the user to choose or just use all we have]
As mentioned earlier, this may not only be about the idea, but also using other data sets like ILSVRC to improve classification accuracy.
Week 9 (July 18 - July 24)
In Week 9, I’ll be implementing things based on the above analysis. And form a generic interface to use alternative algos or datasets in pypi-catimages.
Week 10 (July 25 - July 31)
In Week 10, once again it’s time for a buffer week. I’d love to create a blog post at this time about the different algorithms I have researched about in the past 2 weeks. Along with this: reviews, documentation, and tests will be needed to get this part merged.
Measurable outcome: ( propose list with new "experimental" technologies - similar to T135993 - to look at and TRY to include some of them... finish up documentation and tests for week 11)
Week 11 (Aug 1 - Aug 7)
In Week 11, as I am not sure if I will be equally free as the earlier weeks (See section on other commitments), I’d like to improve the usability of the pywikibot interface to catimages.py - which does not require as much math. A few things that can be implemented are:
- Logging results appropriately for corner cases in case the user wants to check results after a bulk run and then see all the results later for verification. (DrTrigon mentioned this is already done to some extent, but can be generalized and improved)
- Allow user to give some hints on the category (Maybe a list of probably categories) and ignore other categories even if they rank higher than this one.
Measurable outcome: ( visit https://meta.wikimedia.org/wiki/WikiConference_India_2016 and have the code beta tested by say >=5 other users there in order to get them involved and receive feedback from them - use this to make the code installation and execution stable and reliable and easy to use )
Week 12 (Aug 8 - Aug 14)
Week 12 is meant to be a buffer week, in case something did not go as planned. This is to complete any backlogs, finish up unit-tests, add more documentation, etc. One key thing I’d like to work on in this week is documentation on how to setup catimages.py and how to use it. It would be good to provide more information for future contributors.
Measurable outcome: ( implement all the beta-testers bug fixes ) (milestone?)
Week 13 (Aug 15 - Aug 23)
Week 13 is also meant to be a buffer week. All pending documentation and tests will be completed during this time.
I’ll also be writing my final blog post and wikitech-I mail before the deadline sometime this week.
Measurable outcome: ( 1. have a stable and reliable pypi package "pypi-catimages", dependencies, etc. which can be installed and runned easily by any user. 2. Put effort into trying to add new "experimental" technologies like: (see week 9). 3. give a list of the features this package has and one of the features missing compared to old catimages.py ) (milestone?)
- I'm familiar with frameworks like Git, Android, OpenCV, GTK, Lucene/Solr, Django, Flask, and ROS
My current experience in wikimedia:
- Have set up the mediawiki core and pywikibot-core on my local machine.
- Familiarity with code and coding conventions in pywikibot.
- Worked on the micro tasks: T76211 related to catimages.py, T67192 related to pywikibot-core.
- Worked and gave input on other tasks: T123092, T115428, T124287, T73738, T67176 and general activity can be seen at @AbdealiJK
- Adept in python. Refactored pycolorname (gerrit -> github) to use classes and work with python 2 & 3.
- Good familiarity with continuous integration - set up circleci in pycolorname's new github repo.
- Good understanding of setuptools and pip - setup pypi deployment in pycolorname. Also set up automated nightly releases using rultor.
- Basic understanding of Computer Vision, Math and Data Analytics (through courses at my university).
- Made a few edits in wikipedia.
Other experience and projects in Opensource:
- I've participated in GSoC 2015 under GNOME/coala
- Attended the GUADEC 2015 conference in Sweden.
- Participated in India Hacks in the OpenSource track and attended the conference too.
- Contributed to various open source orgs to fix bugs which affect my work: validators, civicrm-buildkit, amp, autopep8
I am a fourth year undergraduate from Indian Institute of Technology, Madras. I was passionate about programming and development since High School, and got introduced to the world of FLOSS which made me interested to get involved in contributing. I have been involved in a few hackathons held in my college. My enthusiasm in developing FLOSS was intensified after interacting with Richard Stallman, the founder of the Free Software Movement who was in my college in 2014.
I started my journey in FLOSS by hacking with coala, and got introduced to the gnome community from it. I participated in GSoC with them and had an amazing experience with the folks there. I later went to the GUADEC conference and presented my work in GSoC 2015 over there. I later went on to participate in the opensource round in IndiaHacks and will be going to the conference too.
I’ve been interested in the wikimedia community since a friend of mine (Kunal Grover) had done his GSoC with wikimedia. Hence, I wish to participate and get to know the awesome community that created the website I used most in the 4 years at my university :)
I always try to stay logged into IRC (Channels: mediawiki, wikimedia-dev and pywikibot) during my working hours and will try to contribute back to the community as much as I can (code, documentation, and IRC). I am regular in replying to emails, hangout chats, phabricator, gerrit, and IRC (as long as my name is mentioned in the IRC chat). All source code written by me will be regularly published, reviewed and improved. I will keep my mentors updated about the progress through e-mails, IRC, or phabricator. All discussions regarding design and implementation will be public.
Other commitments (May 23 to August 23)
I am currently in my final semester and have 3 courses going on. My final thesis viva is tentatively around May 15 and I will have no course work after that.
Other than this, I have no plans for the summer. I propose to spend about 35-40 hours every week on this project. Towards the end of the GSoC period I may begin work for my job after university. The dates for this have not been fixed, and I will plan appropriately to not affect my GSoC in any way. Obviously, these times are subject to the project status, and extra time to meet deadlines will not be an issue from my side.
I will be active between 9:00am to 5:00pm, with an hour break sometime in between. These working times are complementary to the time my mentors are active.