Page MenuHomePhabricator

[GSoC 2016 Proposal] Port catimages.py to pywikibot-core
Closed, ResolvedPublic

Description

Port catimages.py to pywikibot-core

Phabricator task containing the proposal : T129611: [GSoC 2016 Proposal] Port catimages.py to pywikibot-core

Personal Information

  • Name : Abdeali J Kothari
  • Email : abdealikothari@gmail.com
  • Github : https://github.com/AbdealiJK
  • Wikimedia accounts : Special:CentralAuth/AbdealiJK
  • IRC nick : AbdealiJK
  • Time Zone : UTC+5:30 (IST - India)
  • Typical working time : 9:00am to 6:00pm IST
  • Location : Chennai, India
  • School / Degree : B.Tech. Engineering Physics at Indian Institute of Technology, Madras. Expected to graduate in - August 2016

Abstract

The aim of the project is to bring to life the catimages.py script from pywikibot-compat. This involves heavy refactoring of the script. While doing this refactoring, it’d be useful to modularize the script and make it a generic package. The generic package can then be used in pywikibot-core to provide the same functionality as it provided earlier.

  • Possible Mentors : DrTrigon (@DrTrigon), John Vandenberg (@jayvdb)
  • Languages used : Python majorly, and C/C++ for opencv dependency. Maybe PHP in case we plan to make a mediawiki extension
  • Related phabricator issue : The issue this proposal solves is T66838: Port catimages.py to core

Project Description

Motivation

To wikimedia: catimages.py brings about automation in categorizing images. Its an invaluable tool to have, which can be extremely accurate considering the recent innovation in Computer Vision (CV). Using catimages.py we can give uploads more meaningful metadata to work with.

To pywikibot: We can bring back automated categorization without manual intervention to pywikibot with this project. Pywikibot already has the scripts imagerecat.py and checkimages.py. imagerecat doesn’t work right now as it uses CommonSense from wikisense which is dead (T60869#1365653 and T78462). And checkimages requires manual input. As catimages attempts to categorize without any prior information (using just the file itself), if done right, it would be easier to use.

To the rest of the world: Wikimedia is all about providing more data to everyone. One awesome outcome of moving catimages.py to pywikibot-core is that all the dependencies either become external pypi packages or get moved upstream to other packages. This is really good as it makes the tool usable by people who would not be able to manually install all of the dependencies ! With the pycolorname package that I developed in the microtask, I was able to make the following package - which is a great resource for developers handling color. Even non-developers can refer to the charts generated by it.

Implementation

The project can be broadly divided into 3 parts,

  1. Dependency checking and updating the code to fix deprecated deps and general refactoring.
    • Refactoring (use pip where possible)
    • Port to pywikibot-core
  2. Optimizing the script to work better and faster.
    • Improve CV code by Optimizing (e.g. replace used algorithms by better ones)
    • Add new features
  3. Tighten integration with commons interface (javascript templates or annotation tool with enhancements)

Refactoring
To explain the first part, we need to take a look at all the libraries and tools that catimages.py can use (Refer to the table at T66838). As can be see in the table, the following needs to be done to the dependencies:

  • Some packages are usable directly (eg: numpy, scipy) - minimal work required
  • Some need to be patched upstream (eg: music21, bob) - not much work, but may need time (3rd party maintainer involved)
  • Some need to be packaged and uploaded to pypi (eg: pycolorname, jseg)
  • Some need to be replaced as they are deprecated (eg: PIL, pyexiv2)

Optimizing
The second part of optimizing the script catimages.py does not have a concrete game plan right now. This will be decided over the course of the project. I have worked on similar domains at my university. Possible algorithms that could be used:

  • r-cnn (2014, Girschick et. al.) and for faster machines this may be a good idea.
  • SPP-net (2015, Shaoqing Ren et. al.) is a faster version of rcnn but it takes longer to train. This may be beneficial in certain scenarios though.
  • LCKSVD is another algorithm I’ve used in the past able to get not-so-great results, but it works very fast.

One simple idea should be to provide a good interface with VLFeat - a library of Computer Vision algorithms specializing in image understanding. we can integrate vlfeat into the package and use the common algorithms directly. Although these algorithms are old (most are published in 2012 or earlier) they are useful and the library itself is flexible. This has the benefit that we will not have to maintain the algorithm specific code. Also, VLFeat which was created to maintain a set of these algorithms will probably be better at it.
This may not only be about the algorithm, but also using other data sets. In specific I would like to try using the ILSVRC (or ImageNet) dataset to improve classification accuracy. The benefit of using this dataset is that it is an order of magnitude larger than Pascal (detailed comparison can be found here), implying we have more supervised training.

Timeline

Community Bonding (April 22 - May 22)

The two major aspects of the project that should be completed here are:

  1. Hack catimages.py (and deps) to be usable without pywikibot on a personal github repository. This is to get the basic functionality usable and testable.
  2. Understanding the math behind catimages.py. The algorithm and logic used in the script can be refactored to use newer method from opencv, sklearn, and scipy which are now optimized and more stable.
  3. Find methods to optimize catimages.py using new techniques.
NOTE: I've already begun on the first task at https://github.com/AbdealiJK/file-metadata

Week 1 (May 23 - May 29)

The aim in Week 1 is handling all the modules which have patches. The reason to do this first is because it involves 3rd party developers who may take time to reply if upstream changes are needed. This includes the modules: bob, jseg, music21, xbob-flandmark.
What needs to be done is check out why the patches have been applied. Identify whether the patches are still needed and do the appropriate changes upstream or inside catimages.py. Interestingly, most of these patches have small changes which simply add paths to sys.path or specify args. Hence, this can be solved within a week.

Week 2 (May 30 - June 5)

The aim for Week 2 is to replace unsupported packages and make pypi packages for packages which currently rely on archives (like .zip). Unsupported packages are: zbar and pyexiv2. Archives are: yaafelib, slic, jseg, and jseg/jpeg-6b.
We need to move these into github repositories and deploy to pypi if needed. And then update catimages.py to use their new versions.
Measurable outcome: run- and testable first core bot script

Week 3 (June 6 - June 12)

In week 3, it would be a good idea to pause and review the code written so far, fixing any minor issues and so on that can occur with the code. This is also a buffer week incase any work from earlier weeks is still pending. I’d also like to write a blog post about my experience so far.

Week 4 (June 13 - June 19)

In week 4, the aim is to begin revamping the opencv code. First off cv is deprecated and cv2 needs to be used instead. Also, the latest cv2 python bindings are are not backward compatible and need to be updated. This would give me good understanding of the Computer Vision part in catimages.py.

Week 5 - Midterm Evaluation (June 20 - June 26)

The opencv package is a little problematic. It has custom C++/C code (lots of it) which needs to be refactored. Hence, I’ll be using my knowledge (from last week and community bonding) about how sklearn and OpenCV can clean up the opencv package.
Hence, for the midterm evaluation I plan to have a pypi package which can perform all the functions that catimages.py could. This would be independant of pywikibot-core, probably just as a simple package. Let’s call this pypi package pypi-catimages to avoid confusion.
Measurable outcome: Have a pypi package "pypi-catimages" which can perform a minimal set of the functions that catimages.py could: Detect faces using haarcascades, Categorize based on metadata, Barcode and QR Code detection, Graphics detection. (milestone?)

Week 6 (June 27 - July 3)

By Week 6 I plan to begin integrating the prior created pypi package with pywikibot. This would be to use cli arguments from pywikibot, use the page-generators in core, and update all pywikibot related functionality.
Not quite sure where this script will be. Possible approaches are:

  • pywikibot-core and pypi-catimages are requirements for another pypi package pywikibot-catimages for it to be used with mediawiki.
  • A script is created in pywikibot-core which uses the pypi-catimages and handles the args required for the script to be used with mediawiki.

Measurable outcome: The outcome here would be a Pull Request / Patch Set which adds functionality to interface with pywikibot to the pypi-catimages. [ok; I think we are quite close - did we miss or negelect something until here we have finish work on?]

Week 7 (July 4 - July 10)

In Week 7, again, a pause for testing and review of code being pushed to pywikibot-core. I would want to write unit tests here and make sure the PR gets reviewed and accepted. This would be a good time to make a blog post about how catimages.py was ported so that other scripts that still reside in compat can refer to the methods I used. This is also a buffer week. In case one of the above weeks did not go according to plan, this is the week to fix this problem.
Measurable outcome: fully working core bot script that has equivalent functionallity as old compat bot - clearly indicate what functionally is missing and why [very ambisious; do we want to have a measureable after a buffer week? needs to be before week 8... - the blog post should mention what are differences/what is missing to, kind of contain a todo list, after this week the bot should be able to add the proper category tags, e.g. for uncategorized files]

Week 8 (July 11 - July 17)

Week 8 marks the beginning of the second part of GSoC and the second stage of this project too. Here, I would like to do some research on computer vision. Probably recent literature on classification. It would be good to use some newer CV techniques with various pros/cons. We could bundle this into pypi-catimages and allow the user to choose which algorithm to use. [I'm not sure whether we should allow the user to choose or just use all we have]
As mentioned earlier, this may not only be about the idea, but also using other data sets like ILSVRC to improve classification accuracy.

Week 9 (July 18 - July 24)

In Week 9, I’ll be implementing things based on the above analysis. And form a generic interface to use alternative algos or datasets in pypi-catimages.

Week 10 (July 25 - July 31)

In Week 10, once again it’s time for a buffer week. I’d love to create a blog post at this time about the different algorithms I have researched about in the past 2 weeks. Along with this: reviews, documentation, and tests will be needed to get this part merged.
Measurable outcome: ( propose list with new "experimental" technologies - similar to T135993 - to look at and TRY to include some of them... finish up documentation and tests for week 11)

Week 11 (Aug 1 - Aug 7)

In Week 11, as I am not sure if I will be equally free as the earlier weeks (See section on other commitments), I’d like to improve the usability of the pywikibot interface to catimages.py - which does not require as much math. A few things that can be implemented are:

  • Logging results appropriately for corner cases in case the user wants to check results after a bulk run and then see all the results later for verification. (DrTrigon mentioned this is already done to some extent, but can be generalized and improved)
  • Allow user to give some hints on the category (Maybe a list of probably categories) and ignore other categories even if they rank higher than this one.

Measurable outcome: ( visit https://meta.wikimedia.org/wiki/WikiConference_India_2016 and have the code beta tested by say >=5 other users there in order to get them involved and receive feedback from them - use this to make the code installation and execution stable and reliable and easy to use )

Week 12 (Aug 8 - Aug 14)

Week 12 is meant to be a buffer week, in case something did not go as planned. This is to complete any backlogs, finish up unit-tests, add more documentation, etc. One key thing I’d like to work on in this week is documentation on how to setup catimages.py and how to use it. It would be good to provide more information for future contributors.
Measurable outcome: ( implement all the beta-testers bug fixes ) (milestone?)

Week 13 (Aug 15 - Aug 23)

Week 13 is also meant to be a buffer week. All pending documentation and tests will be completed during this time.
I’ll also be writing my final blog post and wikitech-I mail before the deadline sometime this week.
Measurable outcome: ( 1. have a stable and reliable pypi package "pypi-catimages", dependencies, etc. which can be installed and runned easily by any user. 2. Put effort into trying to add new "experimental" technologies like: (see week 9). 3. give a list of the features this package has and one of the features missing compared to old catimages.py ) (milestone?)

Extra time: In the event that I have extra time because of the project going exceedingly well (I can hope, can’t I? :D), I’d like to work on testing. I’ve seen posts on the mailing list about unit-tests failing and there’s a tracking issue about the unit tests and integration tests (T60941 and T72336). Testing is something that I’ve not done enough of and think it’s an avenue where I can learn and help the community simultaneously. Another thing I was interested in is the python3 support.

Experience

Skills:

  • I'm familiar with programming languages like [Python](), Javascript, C, PHP, Java, Matlab (in order of proficiency)
  • I'm familiar with frameworks like Git, Android, OpenCV, GTK, Lucene/Solr, Django, Flask, and ROS

My current experience in wikimedia:

  • Have set up the mediawiki core and pywikibot-core on my local machine.
  • Familiarity with code and coding conventions in pywikibot.
  • Worked on the micro tasks: T76211 related to catimages.py, T67192 related to pywikibot-core.
  • Worked and gave input on other tasks: T123092, T115428, T124287, T73738, T67176 and general activity can be seen at @AbdealiJK
  • Adept in python. Refactored pycolorname (gerrit -> github) to use classes and work with python 2 & 3.
  • Good familiarity with continuous integration - set up circleci in pycolorname's new github repo.
  • Good understanding of setuptools and pip - setup pypi deployment in pycolorname. Also set up automated nightly releases using rultor.
  • Basic understanding of Computer Vision, Math and Data Analytics (through courses at my university).
  • Made a few edits in wikipedia.

Other experience and projects in Opensource:

About Me

I am a fourth year undergraduate from Indian Institute of Technology, Madras. I was passionate about programming and development since High School, and got introduced to the world of FLOSS which made me interested to get involved in contributing. I have been involved in a few hackathons held in my college. My enthusiasm in developing FLOSS was intensified after interacting with Richard Stallman, the founder of the Free Software Movement who was in my college in 2014.

I started my journey in FLOSS by hacking with coala, and got introduced to the gnome community from it. I participated in GSoC with them and had an amazing experience with the folks there. I later went to the GUADEC conference and presented my work in GSoC 2015 over there. I later went on to participate in the opensource round in IndiaHacks and will be going to the conference too.

I’ve been interested in the wikimedia community since a friend of mine (Kunal Grover) had done his GSoC with wikimedia. Hence, I wish to participate and get to know the awesome community that created the website I used most in the 4 years at my university :)

I always try to stay logged into IRC (Channels: mediawiki, wikimedia-dev and pywikibot) during my working hours and will try to contribute back to the community as much as I can (code, documentation, and IRC). I am regular in replying to emails, hangout chats, phabricator, gerrit, and IRC (as long as my name is mentioned in the IRC chat). All source code written by me will be regularly published, reviewed and improved. I will keep my mentors updated about the progress through e-mails, IRC, or phabricator. All discussions regarding design and implementation will be public.

Other commitments (May 23 to August 23)

I am currently in my final semester and have 3 courses going on. My final thesis viva is tentatively around May 15 and I will have no course work after that.
Other than this, I have no plans for the summer. I propose to spend about 35-40 hours every week on this project. Towards the end of the GSoC period I may begin work for my job after university. The dates for this have not been fixed, and I will plan appropriately to not affect my GSoC in any way. Obviously, these times are subject to the project status, and extra time to meet deadlines will not be an issue from my side.
I will be active between 9:00am to 5:00pm, with an hour break sometime in between. These working times are complementary to the time my mentors are active.

Related Objects

StatusSubtypeAssignedTask
InvalidNone
OpenFeatureNone
DeclinedNone
ResolvedAbdealiJK
ResolvedNone
ResolvedDrTrigon
ResolvedAbdealiJK
ResolvedDrTrigon
ResolvedDrTrigon
ResolvedNone
ResolvedNone
ResolvedDrTrigon
ResolvedNone
ResolvedNone
ResolvedNone
DeclinedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedAbdealiJK
ResolvedAbdealiJK
ResolvedAbdealiJK
ResolvedDrTrigon
ResolvedAbdealiJK
ResolvedDrTrigon

Event Timeline

jayvdb renamed this task from [GSoC 2016 Proposal] Port catimages.py -> pywikibot-core to [GSoC 2016 Proposal] Port catimages.py to pywikibot-core.Mar 12 2016, 8:52 AM
jayvdb updated the task description. (Show Details)

So first thanks for the effort to all of you! Special thanks to AbdealiJK!

I decided that I would like to get more involved into this so it's fine to put me in - wherever you want. :)

About the proposal:

1.) I like the idea to think about including r-cnn, SPP-net, LCKSVD, VLFeat - do they have python bindings?

2.) As a name I would like to propose "metadata". Since thats excatly what all the code is about, finding, reading and generating context depending metadata. What do you think? Is there already another software package of that name out there?

3.) Personally I would vote fore this:

  • A script is created in pywikibot-core which uses

the pypi-catimages and handles the args required
for the script to be used with mediawiki.

since I am most used to this from the past. As I understand it's more straight forward with respect to deps and a stand-alone (qt) package

4.) A very nice idea but be to either define a subset of wiki commons files as a dataset (involves a lot of work - more a future project) or to upload used VOC datasets to commons.

5.) You mention some things that can be implemented like:

Prompt the user for what category should be used
if the bot finds that it is not sure of the category.
i.e. No category has a considerably higher
probability than the rest.

You should not weight categories against each other. Either the score/indicators are high/strong enough to put it into a cetegory or not. This is more a question for a stand-alone app, which would just put it into a category like "uncategorized" or "needs manual intervention" based on the fact that the library does return no result.

Logging results appropriately for corner cases in
case the user wants to check results after a bulk
run and then see all the results later for
verification.

This is implemented to some part already have a look at my common pages. But need generalization and improvement for sure.

Allow user to give some hints on the category
(Maybe a list of probably categories) and ignore
other categories even if they rank higher than this
one.

Having kind of filter or ruleset configuration that can be modified by the end user in a config file would be nice. Parts of such a structure is actually given by how the code is organized already. This could be moved to a separate config python script thats gets loaded and included in the beginning.

6.) Unitests: +100 ;))

7.) Kind of a personal wish but I think it could be useful for the future (e.g. your blog) as well: Could you write a short/quick step-by-step install and test howto? That would help me giving you support enormously. I think it would help to get more feedback as well. And it and essential part of the docu for later anyways.

Thanks and Greetings

Thanks a lot for the feedback !

1.) I like the idea to think about including r-cnn, SPP-net, LCKSVD, VLFeat - do they have python bindings?

It has unofficial bindings. cyvlfeat is the most recent one of them. If that doesn't turn out to be sufficient, we can use this using simple cmdline utilities (Wrapping it thinly using subprocess).
r-cnn and spp-net have matlab codes, but that's fine as once we train the model - tjhe hard part is done. They both use caffe which works in python, so I can re-write the classification code in python. LCKSVD - Not sure.


5.) You mention some things that can be implemented like:

Prompt the user for what category should be used
if the bot finds that it is not sure of the category.

I'll think this section over a bit. These were not very planned out, I agree.


7.) Kind of a personal wish but I think it could be useful for the future (e.g. your blog) as well: Could you write a short/quick step-by-step install and test howto?

Is this install and test howto for pypi-catimages during the GSoC ?
Or is this about the pycolorname package ?

r-cnn and spp-net have matlab codes, but that's fine as once we train the model - tjhe hard part is done. They both use caffe which works in python, so I can re-write the classification code in python. LCKSVD - Not sure.

I tend to disagree. We should avoid having code (even for training) that is not open-source IF possible! IF there is a suitable alternative to using matlab code we should definitely go for that!
The reason for this is that I'm quite sure that there is commercial CV code out there that can do what we need - but the point was to do it open-source and free.
Having said that we can discuss about how much to deviate from this policy... e.g. the training could be one of these.... (but if possible we should avoid that)

Is this install and test howto for pypi-catimages during the GSoC ?
Or is this about the pycolorname package ?

I had the 'pycolorname' in mind, but as soon as you need/want feedback by me (and others) about 'pypi-catimages', this would make everything much simpler and we all would know that we are talking of the same stuff.

@DrTrigon I think there's been some mistake in communication. Those codes are open source. I didn't want to use matlab in the package (i.e. I didn't want matlab to be a dependency) - and hence mentioned I could write python code for the classification.

IMPORTANT: The deadline for submitting your proposal to Google Summer of Code 2016 application system at GSoC application system falls in roughly 24 hours at Mar 25 2016, 19:00 UTC. Please make sure that you have a pdf copy of your proposal in the application system beforehand, to avoid last minute confusions. Remember to relate your Phabricator task and associate 2 mentors in the proposal description, so that it gets easy for review. Past the deadline, you should only make changes limited to fixing typos, or incorporating feedback's. Good Luck, and check out the micro-tasks!

Congratulations @AbdealiJK for getting selected for this project in GSoC 2016! Wish you a good luck with it. You can start discussing ideas and get to speed with project as the Community Bonding period has started.

Hi Sumit,

Thanks a lot ! I'll make the most out of this awesome opportunity :)

Welcome to Google-Summer-of-Code (2016) and to the Community Bonding period! Happy to have you here, and this should be crucial time to create important decisions regarding how your project should take shape during this two month internship period. You can find information about Community bonding period, from our Life of successful doc here. To make sure everything go as per planned, please follow the instructions in T133647 and create the 'Community Bonding Evaluation for $project' task as a subtask of your proposal task. Please note that all further tasks you create for evaluation and GSoC organization purpose should be subtasks of your proposal, and not the parent task - lets reduce the notification count. In case you are stuck, feel free to comment below T133647 or open up a conpherence task with the mentors and org admins. You can find example tasks in the task description of T133647

DrTrigon updated the task description. (Show Details)
DrTrigon updated the task description. (Show Details)
DrTrigon updated the task description. (Show Details)