Page MenuHomePhabricator

[GSoC requirement] Weekly Reports for Port to pywikibot-core
Closed, ResolvedPublic


Weekly reports of progress made for the GSoC project T129611.

Community Bonding (April 22 - May 22)


(Minutes and Agenda are mentioned in the task related to the Meeting)

Week 1 (May 23 - May 29)

  • Dependencies on Travis: I tested a lot of dependencies on travis and got them installed there. I found that dlib would be easier to install with conda because the pkgconfig.h wasn't found. Plus, conda has binaries for scipy making it much faster than installing with pip. (Even cyvlfeat is easier with conda...)
  • Framework: I made a nice OOP framework like how catimages used to work. Classes for each file type. and a generic GenericFile class which everything inherits from. This is split into multiple files to make it easy to read and modular. 45eae3
  • Exif data: Read exif data using exiftool. Using json to parse data. b51228
  • Mimetype: Use magic for mimetype analysis with backups 1102cf
  • T135836: Meeting about Unidentified Faces
  • music21 and ffprobe: Implemented ffprobe and avprobe support. Now, audio files are completely handled - Duration, rate, streams info, etc. are available.

Week 2 (May 30 - June 5)

  • Color Info - Used colormath and pycolorname to get the average color of the given image. e3f1bb
  • Facial landmark - Got dlib working but used conda to install it on Travis. It is also available in pip, but boost.python is required extra. 82ab50
  • Barcode and QRCode - libzbar has not been updated since 2009 and libdmtx has not been updated since 2011. Found zxing which supports more types of barcodes than both zbar and dmtx put together, so I'm using this. It's also much easier to install as it just needs java, and I can auto download a .jar file to run it. ZXing is not active, but it's still being supported - couldn't find anything better out there. 465ea4
  • Bulk testing:
    • Tried toolslab, but it just hangs my SSH frequently which is troublesome. Then I checked up the dumps which toolslab already has, but they don't have the images ...
    • Finally went to the backup we discussed of using Travis. In this, I got the system to work, but it timed out sometimes with a SSLError non-deterministically. Note that travis is really fast a expected, can do 2000+ files within 5 minutes.
  • Results at

Week 3 (June 6 - June 12)

  • Dependencies - Made and also made as much "auto-installation" as possible through out the package:
    • pypi packages - Make various packages install automatically with pypi if that's supported
    • barcode analysis needs java. So, fails if java is not installed. Then, the first time it's run, i download .jar files into a folder and use that.
    • exiftool - I download the exiftool perl script and run that using perl. Hence, fails if perl is not installed.
    • ffmpeg - I found static builds (standalone builds) for linux 32/64 bit. I use these if the platform is linux, otherwise I make fail if avprobe/ffprobe are not found while installing.
    • Removed opencv - OpenCV is difficult to install and was the only python library not in pypi, so I decided to use scikit-image instead which is easy to install, and also supports Python3. sciikit image can open many more image types like animated GIF, TIFF, etc. And seems to be more useful. It was easy to replace the current usage of opencv with skimage, and I can add opencv back (as an optional dependency) if ever I think it has something skimage does not have.
  • File conversions - There were cases where zxing doesnt support certain files, and svg needed to be converted to png, etc. Handle those issues
    • zxing - If the image in JPEG encoded with CMYK, write a tempfile with RGB data and use that instead. c8d60a
    • XCF - Use imagemagick to convert xcf files. Earlier convert was being used with subprocess, but wand seems to be better as it's the official python binding. b1424b
    • SVG - Convert SVG to PNG and then to the normal raster analysis. Use wand and imagemagick again to avoid additional dependencies. 4e252c

Week 4 (June 13 - June 19)

  • Make a simple bot and readme - User:AbdealiJK/file-metadata and gist:a94fc0
  • Draft email to commons-l - User:AbdealiJK/file-metadata/Email
  • Upstream
    • dlib - Created the fix for dlib/136
    • matplotlib - Created PR to suggest installation step matplotlib/6575
    • skimage - Found unusual bug in skimage where the file reading is not giving the expected output - scikit-image/2154
  • Bug fixes -
    • Unicode encodings - There was an issue in the decoding from C libraries pointed out by Zhuyifei c755e7
    • Dict issue - The dict did not handle keys which had a value before and get changed to None. Pointed out by Zhuyifei e11dc9
    • Large decompressed files - Pillow threw a warning for files that had too many pixels. We now warn and ignore this gracefully. Fixed in 19382d
    • zxing small images - ZXing has an issue with very very small files because the first 3 pixel locations were hardcoded. Ignore small images for zxing. Fixed in 1c0de6. Fixed upstream too zxing/607.
    • zxing unsupported image type - ZXing does not support CMYK files. In some cases, exiftool is unsure if a file is RGB or CMYK. Here, just assume it's cmyk and convert it. Fixed in 7fc106
  • New features
    • zbar support - Added zbar support 9a8e62. zbar detects barcodes that zxing does not detect, especially vertical barcodes. It seems to have more false positives too. zbar also has some performance issues: It considerably slower as compared to zxing (3 times slower on my computer) and is considerably more memory intensive (My system hangs when running zbar tests).
    • Created with - Analyze the exif data and provide a simpler analysis routine by checking all the traces of different softwares. This routine has a curated list of softwares which have a specific key associated with it. 3eb80c
  • Bulk tests - I got the bot running in the toolslab server (Zhuyifei wanted to test it there and I thought it'd be a good idea to get it running there). I also got it to run in bulk using toolslab (because Travis' time limit was getting annoying). Mae the logs at ...logs/Category_Images_from_the_State_Library_of_Queensland and .../logs/Category_JPEG_files

Week 5 - Midterm Evaluation (June 20 - June 26)

  • Upstream:
    • installation using wheels - I verified that wheels are used when installing all the dependencies except dlib. (This was tested on ToolsLab, on my local computer and Travis). There may be cases when wheels aren't used, but in common(manylinux?) systems they are used.
    • dlib - Make wheels rather than compiling dlib/138
  • New features:
    • Haarcascades - Made haarcascades work bc55f9.
    • Line detection - Made features for line detection to work 723a23
  • Testing:
    • Bulk tests with zbar, zxing, haarcascades, dlib can be found for JPEG_files and SLQ.
    • Test on files from Category:Faces - Ran on the subset Male Faces
    • Test on files from Category:Line_drawings - Ran on Line drawings
  • Bug fixes:
    • Disk space clean up - Added the close() function and also the __enter__ and __exit__ to delete the files which are created temporarily while doing analysis (Example: XCF -> PNG, etc.) 6bbca2
    • Memory leak - Fixed memory leak because of which when running the bot for more than ~1500 files, it kept giving memory error in ToolsLab (Which has 16GB RAM). bcb071
    • TIFF files in zxing - ZXing cannot handle TIFF files and they were being ignored. Now, we convert it to PNG and use the PNG with zxing to detect barcodes. 4ec26a
  • Misc
    • Wikitech - Made page on wikitech about how to install the bot there wikitech:User:AbdealiJK/file-metadata.
    • Mention Categories - Made the bot script mention which categories the file should belong to []()
    • Dependencies - I've removed some dependencies which were very trivial in favour of just making the function myself in file-metadata. The dependencies did not have apt-get packages, so in future if we want to make a .deb package of file-metadata, it becomes easier.

Week 6 (June 27 - July 3)

  • Bulk tests
  • Upstream
    • dlib - I began working on this, and foudn it to be a major time sink. I needed to understand wheels, and manylinux to be able to make it. I got Matthew Brett who has made many wheels for numpy, scipy, matplotlib, etc. interested in this, so hopefully he can get the wheel part done and I can help out where necessary. dlib/138
    • pillow - Ensured that the Pillow version 3.3.0 due on July 1 will have wheels :) Pillow/1859 and Pillow/1990 and pillow-wheels/36
  • Fixes/Features
    • Dependencies - Better way to handle dependencies -
      • Created a class for each dependency which checks whether it is installed and so on. c85dd4
      • This class provides a better error message if something is not found. Fixes file-metadata/46, c85dd4
      • Removed "auto download" of exiftool and ffmpeg, as downloading binaries should be done by package managers preferably. The new error messages ensure it's installed correctly during pip install. 035163
      • Using this, we now have a class for zxing also, which downloads the binary during pip install. (ZXing does not have a apt-get, yum, etc. package. It's only on maven, and we don't want users to install maven ...) a19904
    • Vehicle detection - Tried vehicle_detection_haarcascades which provides a cascade for cars. The code is in a separate branch 6e7c19 for testing. The cascade is not very nice, it can only detect small cars. And if I give something like File:MERCEDESBENZ600Coupe-C100--2118_3.jpg it detects only false values around the car ... In File:Södertäljevägen_vinter_2013a_01.jpg it was able to detect 2 cars correctly. So it seems this cascade is meant for videos, where there is blurring and so on and the car is moving. The training set doesnt seem very good for static good quality images where the car is the main subject.
    • Face detection using Exif data
      • Found images which have exif data from the camera for most of the makes.
      • Implemented code b31a7c which handles fujifilm, sony, nikon, and panasonic cameras. I found that all except the Sony image's face was incorrectly detected.
      • Verified that my code was giving the same results as the original perl script. It seems the original script was not very correct in it's decoding of face data, or the protocols have changed now... I tried correcting it but I couldn't figure out how to encode the data.
      • Conclusion: I think this is not worth it, and pretty difficult to maintain. Probably cameras will come to a standardized way to save this (EXIF may add it to their specifications?) as it gets more popular. We could think about doing it then ...
    • RGBA handling - Earlier, we used to simply strip the alpha channel out. This is incorrect. as some logo png files use alpha to darken or undarken a color. I came across this SO question and now I assume rgba images are in a white background and convert it to RGB using the appropriate formula. This makes RGBA handling much more sane and intuitive. c20eca Also created an upstream about this scikit-image/2174.
    • Script
      • Made a CLI script which is the "official bot demo script" as it resides directly in the github repo - no need to download a gist and so on. The console script is called wikibot-filemeta-simple b6921c
      • Mention Categories - Made an argument -showcats which shows th categories based on the analysis rather than the detailed results of the analysis. 9bafa1

Week 7 (July 4 - July 10)

  • Bulk tests
  • Upstream
  • Fixes/Features
    • Softwares - Detect if software used was GIMP (for PNG/JPEG), gnome-screenshot, Adobe Photoshop photomerge (panoroma), or Paint.NET b950aa
    • Camera/Scanner - Detect the camera model (From EXIF) and suggest appropriate "Taken with Camera Model" or "Scanned with <Scanner model>" categories. There is no nice way to distinguish between a camera and a scanner from exif data sadly. So, I had to just check if the category exists using pywikibot. Because of this, if neither categories exist, we suggest "Taken or Created with ..." and let a human figure it out. I tried using EXIF:FileSource which has the values "Digital Camera", "Reflective Printing Scanner" and so on to find the difference between a camera/scanner - but those were rarely there in the images I found and I wasn't able to check if it was reliable. 207de4/tests/
    • Docker - I wasted quite a lot of time on docker with no good result. It seemed easy, but I found way too many issues to repetitively test things:
      • Had issues like GPG key became invalid (docker-brew-ubuntu-core/52)
      • Minor modifications became troublesome because it had to redownload things (My internet is currently mediocre)
      • It took a lot of time to install everything / build everything (although my computer is decently fast)
      • I then tried focusing on CentOS as Ubuntu became difficult to cope with, but it has it's own issues like boost not being detected, zbar not found in EPEL7 (but found in EPEL6), etc.)
      • I then tried using Travis because my system seemed to be the bottleneck. The travis builds are slow because docker is dist:trusty which doesn't have container support yet. Had to wait nearly an hour sometimes for the build to start !
    • Geolocation - I was able to get the latitude and longitude from EXIF, and then I had to use this to fetch the city/state/country at that lat/lon.
      • I tried geopy which is a wrapper for various Map APIs to find the address based on lat/lon. I initially tried OpenStreetMaps and so on, but found GoogleV3 API was the nicest and provided structured address info (i.e. it told what the locality and country was separately). When doing this, the issue is that Google only allows 1500 free API calls.
      • I then tried to use the demo countries which does this using shape files of the country borders from This requires some heavy dependencies like osgeo+GDAL to do the mathematic computations for shape calculations. GDAL is not easy to install inside a virtualenv ... so, I've left it for now.
      • Currently looking into using nominatim from OSM directly as it does provide structured data (geopy was not fetching the structured data)

Week 8 (July 11 - July 17)

  • Bulk tests
    • Ran JPEG_files with the new bulk script which shows stats and so on too.
    • Ran -newfiles with the new bulk script which shows stats.
      • Newfiles is quite unusual, becaue it seems majority of the files are very large (frequently 300mb tiff files). This makes the analysis and downloading very slow.
      • There are race conditions in newfiles. I do a if (page.exists()): download_page(page) and there seems to be a race condition where the page may be deleted between the exist() check and downloading. This caused pywikibot's NoPage exceptions twice.
      • Some files in the analysis will appear as [[:]] because they have eben deleted after being analyzed, and a bot removes all of the files references. For example File:Cyprien.jpg.
  • Upstream
    • matplotlib - The installation help message for missing dependencies got merged for matplotlib matplotlib/6575.
  • Fixes/Features
    • Scripts:
      • Made the bulk script an "official script" (i.e. it's packaged with pip) so users can run that themselves. It's mainly meant for statistics of file-meta like an integration test. (NOTE: This has not yet been pushed to master though)
      • Added some stats to the bulk script like histogram of number of categories found per file
      • Added options for -showinfo:XYZ with args "cats", "info", "all" - which the user can decide whether to show all detailed info info, only show the categories predicted, or both from the analysis.
      • Added -limitsize:XYZ argument which decides what the maximum size of the file should be in MegaBytes. This helps with not wasting time and bandwidth downloading large files.
    • zxing - Show error when zxing is unable to open file with as this is a problem with the file. This is done with logging.error()
    • GenericFile.create() - Handle djvu files correctly, some were being wrongly detected as images and giving errors when trying tobe opened by skimage.
    • Nominatim - I tried using nominatim, and it worked fine. Currently I'm directly querying without any wrapper for python as it's simple enough. The issue is that nominatim.openstreetmap is not meant for geocoing as mentioned in it's TnC and they only allow a maximum of 1 query / minute to ensure that nobody abuses it.

Week 9 (July 18 - July 24)

  • Bulk tests
    • The template "Bar charts" had a limit of 25 rows. There are people attempting to make it dynamic on wikipedia, but it's not done yet. I've temporarily made User:AbdealiJK/Templates/Bar_chart temporarily which can handle 100 rows.
  • Upstream
  • Fixes/Features
    • In bulkbot - Add the -dry argument to print the output to the terminal if needed.
    • Monochrome - Added Black and White detection based on mean square error thresholding over the color of every pixel. bulk_bot now adds this category to the images. Sepia is not as easy so I'v left it for the time being.
    • Unidentified People - Add "Category:Unidentified People" to all faces that are detected. (or a leaf of it as applicable by Location detection, The country level is always added even if it's a redlink)
    • Groups of People - Add "Category:Groups of People" when there are more than 3 faces detected in a file. (or a leaf of it as applicable by Location detection. The country level is always added even if it's a redlink)
    • Transparency - Add "Category:Transparent Background" if the image has an alpha channel and also the alpha channel is less than 255 at some point (i.e. it's not always opaque even though it has an alpha channel)
    • dlib faces - Added a threshold on score for dlib detected faces (0.045)
    • Football kits - Added subcategories of Category:Football kit templates based on file size as it is very standard.
    • Chemical compounds
      • If the image was created with chemtool, add it to Category:Chemical compounds as chemtool is only used to make compound structures.
      • Earlier we discussed that if there are only C, H, N, and O in text of the SVG it probably is a chemical compound. But it seems the tools users are using to make these SVG aren't using the <text> tag in SVG, rather they use a <g> tag to make a path (elliptical path for O, etc). This makes it difficult to detect the compounds (I checked about 50 images randomly sampled).
    • Locations -
      • Add "Template:GPS EXIF" when the GPS coordinates is found.
      • Also, if the location is found, use that and add Category:<City> or Category:<State> or Category:<Country> to the image (The smaller region is preferred). This is because sometimes the city name isn't valid and complicated cases can arise - for example name clashes like "Category:Punjab, India" vs "Category:Punjab, Pakistan". In which case it's probably better for a human to check it out.
      • Also try to use <City>, <State> (Example: Hew Haven, Connecticut) or <City>, <Country> or <State>, <Country> (Example: Punjab, India) if they exist.
    • Glasses - Tried to detect the subglasses / glasses on a face. But found that the haarcascade frequently detects normal eyes without glasses too. Also, the detection of glasses/eyes is fairly inaccurate and we need to rely on the eye detection code normally. I've run it on more than 300 files now and it has never given a positive or even a false positive ... I don't think it is really useful.
    • Color calibration bars - This is in regard to the color bar jayvdb pointed out in Category:Robert_N._Dennis_collection_of_stereoscopic_views. It seems this strip is not specific to stereoscopic views. It is rather a scanning technique for color calibration. I could not find a category which contains all images with this so I'm currently adding "Category:Scans with IT8 target" to these images. I've tried 3-4 different possible algorithms and found that they had a pretty high error rate:
      • Find intensity profile over the top/bottom of the image and find number of jumps. If the jumps are ~20 it would be a strip. Also, have appropriate conditions to check if the
      • Try Dynamic Time Warping to detect whether the function is similar to floor(x) - but this does not incorporate the intensity (Y) error which varies a lot from image to image
      • Ensure that the bar is black and white to manke the other methods robust - but there is actually a lot of color variation in the bar when doing this.
      • Attempt to detect segments using SLIC, Quickshift, and Felzenswab's algorithm to detect segments and then use these to find the average color. But these segmentation techniques over segment the image and so there is no easy data that can be found from it.
      • I also attempted basic mean and median filtering to reduce error.
    • Stereo cards - Stereo cards are another thing we could detect. They were popular and hence are frequent in museums collections. It has 2 images which look very similar next to each other.
      • Tried doing a simple image detection routine but most of them are pixel by pixel comparisons
      • Attempted to use something similar to Spatial Pyramid Pooling which pixelizes the image and then does pixel by pixel Mean Square Error calculation
      • Tried Histogram based image recognition as similar images would have similar histograms, but here again the MSE of the histograms is not distinguishable from normal images

Week 10 (July 25 - July 31)

  • Testing
  • New statistics
  • Script
    • Added the -skip:X argument which helps to run only 1 in (X+1) files to do analysis on a percentage of a generator evenly sampled over the generator.
    • Used the table template directly and made a new Template:Bartable in commons (Copied from wikipedia) which helps to make large barcharts using tables easily.
    • Fix various UnicodeDecode Errors and race conditions wich halted the script at some files.
    • Handle templates more elegantly using the "tlx" template
  • Faces
    • If a face is taking up majority of the image (55%) consider it to be a portrait
  • Camera models
    • New camera models were discovered which were not being supported correctly. This fixed some models like MotoG phones, Fujifilm, Pentax, Casio, Sony, and Nokia lumia phones. But a lot more a remaining which can be done later.
  • Created with ...
    • Added version information to the templates for some of the softwares: LibreOffice, GIMP, Stella, Gnu Plot, Paint.NET, and Picasa.
    • Better handling of sub softwares of Photoshop (All CS versions detected now) and LibreOffice (Detects sub-tool names like Impress, Calc, etc)

Week 11 (Aug 1 - Aug 7)

  • Fixed various bugs when tests were run on various files.
  • Got green builds on docker-file-metadata with DrTrigon.
  • Tested file-metadata on various docker for reliability testing.
  • Released v0.2.0 on pypi for others to test.
  • Went to Chandigarh to participate in WikiConference India 2016.

Week 12 (Aug 8 - Aug 14)

  • Ongoing participation in WikiConference India 2016.
  • Performing bot runs and making graphs using wikicode for the final report.
  • Fixing minor bugs and corner cases in general for the script.

Week 13 (Aug 15 - Aug 23)

  • Completed the final report.
  • Implemented minor tweaks to the Category:Graphics category which was under performing as compared to the 10% threshold set for MVP.
  • Completed the GSoC submission

Related Objects


Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
AbdealiJK renamed this task from Weekly Reports for Port to pywikibot-core to [GSoC requirement] Weekly Reports for Port to pywikibot-core.May 9 2016, 6:47 AM
AbdealiJK updated the task description. (Show Details)
DrTrigon updated the task description. (Show Details)

@AbdealiJK : Please add in your status report for the previous week.

@01tonythomas Thank you for the reminder. I thought I had updated it here, but I was mistaken. Updated the report for last week now.

Thank you for the weekly reports. Feel free to close this down, as the program just got over.