This proposal didn't get anyone willing to mentor it, so I abandoned it.
Name: Can I not disclose this publicly? I will gladly give my name to WMF, I just don't want it to be too public.
IRC nickname on Freenode: Ostrzyciel
User page: https://www.mediawiki.org/wiki/User:Ostrzyciel
Typical working hours: 10AM to 10PM CET (UTC+1) / CEST (UTC+2)
I have not yet looked for mentors for this project. I also need some help from people who know well the API modules and Parsoid. I am not sure if what I wrote below is sensible and how much of it I should do :)
The following comes from one year of being a sysadmin on a medium-sized wiki (https://nonsa.pl/), that uses a lot of images from Wikimedia Commons. I have shared some of my observations with other sysadmins that maintain wikis also utilizing Instant Commons a lot (Uncyclopedia, mostly), and they have confirmed that these problems occur for them too.
Please note that these problems mostly don't concern Wikimedia, as it uses a completely different solution for accessing Commons. Unfortunately, the rest of the world is not Wikimedia and has to use InstantCommons.
InstantCommons is slooow
For a long time I have tried to find out why some pages took 100 ms to parse, while others could take minutes before they were finally parsed. As it turned out, the decisive factor was the amount of images from Commons present on the page. For every page parse, the parser would sequentially (!) request information about every image, while blocking further parsing. This, when considering that each request is in the order of hundreds of milliseconds, meant that parsing a page could get really slow, very quickly.
This is a serious problem for the wiki I maintain, as we've decided to use as much free-licensed content with proper licensing information as possible, and a big part of that plan is using Commons a lot. We have real cases of pages that contain 50+ images from Commons (like this one: https://nonsa.pl/wiki/Kodeks_drogowy) and they regularly cause 504 gateway timeouts.
Commons brings down entire wikis with it
If Wikimedia Commons is unavailable for some reason, our wiki becomes completely unusable. Every time a page is parsed, it has to sequentially wait for each image's information request to time out, which causes the page parse to time out and halts the entire wiki. This… is more frequent than you'd think, it happens at least once a week and I estimate Commons accounts for about 90% of our unplanned downtime.
This is bad. I'm not asking for Commons to become more reliable (hey, it's free, right?), but remote wikis need a better way for dealing with these outages, as well as outages in other foreign repos they might be using.
Here I suggest a few solutions, i.e. tasks that I could do in the project to alleviate the above problems. I've listed them roughly from most to least sensible/possible/useful.
Batch imageinfo requesting in Parsoid
There isn't really a point in optimizing the current parser, as it's going away anyway, so I propose changing the image rendering code in Parsoid to lay off doing any image information requests to the last, postprocessing stage in the pipeline, somewhere around redlink checking. I think that has already been proposed (comments in code suggest that), but never implemented. Some necessary code (like the ability to make batch requests) is already in place, so this probably wouldn't require deep changes to Parsoid.
As far as I understand Parsoid's nature, rendering images is context-free and does not depend on anything that is not contained within the [[File: (...)]] tag (after expansion, of course), so that should be possible and not break anything. Of course, everything would have to be tested thoroughly.
This would help drastically reduce parse times for pages with a lot of images in foreign repos.
This would also help reduce the load on Commons' public API a bit, in the long term, so Wikimedia might have a small private interest in this :)
Edit: I forgot about multiple density thumbnails, they would slow down the process anyway by being serial. Currently there is no way around this, this would require rewriting an API module, which I describe below.
Rewrite foreignrepo API
See this task for more information: T89971
This is required to make the above fix really work (I think). Currently thumbnails of different sizes have to be requested is separate requests due to constraints in the imageinfo API. The proposed set of features for such an API module is quite well described in the above and related tasks, so that shouldn't be too hard, just time-consuming.
Better ForeignAPIRepo client-side caching (somehow)
This is arguably rather vague, but better here means any form of sensible caching, as all caching is currently disabled by default on InstantCommons (see: T235551). Also when one forces the caching to work, it turns out to be rather… simplistic and unable to handle many cases well.
This is to alleviate the second problem I've described above.
I have three propositions for how to solve this:
- A fixed lifetime cache periodically refreshed by a background job. Images to refresh would be put in the job queue when they approach their lifetime, in batches to speed up the process. When an image has changed, a reparse/links update job is put in the queue for all pages that use that image. The job will be attempted many times in case it fails, to be able to cope with availability problems of the foreign repo. After some thinking I came to the conclusion that this is the most sensible way of doing it, but I include two other propositions as well.
- A simple optimistic cache mechanism. When rendering the page, the parser would just assume the version of image info in its cache is valid (well, unless its cache is empty for this particular image) and return that. After doing that, it would enqueue a task in the job queue to do strict pessimistic checks against the foreign repo for these images. If it fails, the task is put back in the queue. If it succeeds and the images have in fact changed in the meantime, the page is reparsed and the corrected version is presented to the users. This can be also combined with the existing fixed lifetime caching policy, which would override the optimistic cache when applicable.
- A slightly more complicated approach would be to use the same optimistic cache mechanism, but in conjuction with a background pingback service, that would periodically (let's say, every minute) check if the foreign repo is still available. If it is, it would omit the optimistic cache policy and let requests go through to the foreign repo. Otherwise, it would return values from cache while warning the user that the image information might be out-of-date.
This special page is completely non-functional on non-WMF sites that utilize foreign repos. It lists all foreign images making it impossible to find images that are really missing from the wiki. I would hope to fix that through… some means. I'm not sure yet, more investigation needed, but an improved cache would possibly allow for fast checking whether the image really is missing, or is in the foreign repository.
Audit other code utilizing ForeignAPIRepo
I'm not sure how much code using the ForeignAPIRepo there is in the wild, but I could possibly try to find problematic (serial) code and try to introduce request batching, where possible. This is rather low-priority, though.
TBD, I have to consult a few things with people wiser than me first.
To schedule tasks, report progress and do most other dev things I will use Phabricator. The code will of course be in Wikimedia's Gerrit. As for communication I prefer IRC and email, but I can use any other means of communication.
I study at the Warsaw University of Technology. I have a BSc in Computer Science and I am currently pursuing a MSc in Data Science (a CS equivalent). I think I stumbled upon the GSoC program accidentally while browsing mediawiki.org. During the summer I will not have any other significant commitments, I may be unavailable for a few days here and there, but I will definitely deliver everything on time :)
I'm interested in the idea of free culture and software, I currently lead a project dedicated to free humor (more in the next section). I think free culture and open-source software are vital for humanity and part of my motivation for this project is making free media from Commons more accessible.
I've been for over a year a system admin of a medium-sized wiki – Nonsensopedia, which is kind of like Uncyclopedia in Polish, but completely different in some regards, most notably it puts a much larger focus on proper licensing and making sure everything there really is free and funny. It also has much stricter standards regarding hate speech and controversial stuff.
By being a sysadmin I gathered a lot of experience with installing and maintaining MediaWiki. Over the last year I also wrote a few MW extensions for Nonsensopedia that could be useful for other wikis as well, you can find them listed on my user page. All MW-related code I wrote is on our GitLab group, including some forks of extensions and other tools.
I also wrote some patches for MW and other extensions. This includes T246127, T231481, T240893, T205219, d7ff338a4cb3, T228584 and T228579. There are also a few patches in different states that weren't merged (yet).
I also attended the Wikimedia Hackathon in 2019, which really got me interested in MW development. I also met a lot of interesting people there, that have helped me do a lot of this stuff, thank you! :) Unfortunately I can't attend this year's Hackathon in Tirana.
As for other programming experience I wrote things in all kinds of languages (C, C++, C#, PHP, JS, Lua, Python, R, Matlab, Forth, 6502 assembly, yeah, I'll stop), but of course I am not proficient in all of them :) I did a lot of university and hobby projects and I also wrote a few commercial applications (databases, production management, webdev, office automation). No programming can of worms scares me :)