Page MenuHomePhabricator

Proposal: InstantCommons improvements
Closed, DeclinedPublic

Description

This proposal didn't get anyone willing to mentor it, so I abandoned it.

Profile Information

Name: Can I not disclose this publicly? I will gladly give my name to WMF, I just don't want it to be too public.
IRC nickname on Freenode: Ostrzyciel
User page: https://www.mediawiki.org/wiki/User:Ostrzyciel
GitLab: https://gitlab.com/Ostrzyciel
Location: Poland
Typical working hours: 10AM to 10PM CET (UTC+1) / CEST (UTC+2)

Synopsis

Mentors

I have not yet looked for mentors for this project. I also need some help from people who know well the API modules and Parsoid. I am not sure if what I wrote below is sensible and how much of it I should do :)

The problem(s)

The following comes from one year of being a sysadmin on a medium-sized wiki (https://nonsa.pl/), that uses a lot of images from Wikimedia Commons. I have shared some of my observations with other sysadmins that maintain wikis also utilizing Instant Commons a lot (Uncyclopedia, mostly), and they have confirmed that these problems occur for them too.

Please note that these problems mostly don't concern Wikimedia, as it uses a completely different solution for accessing Commons. Unfortunately, the rest of the world is not Wikimedia and has to use InstantCommons.

InstantCommons is slooow

For a long time I have tried to find out why some pages took 100 ms to parse, while others could take minutes before they were finally parsed. As it turned out, the decisive factor was the amount of images from Commons present on the page. For every page parse, the parser would sequentially (!) request information about every image, while blocking further parsing. This, when considering that each request is in the order of hundreds of milliseconds, meant that parsing a page could get really slow, very quickly.

This is a serious problem for the wiki I maintain, as we've decided to use as much free-licensed content with proper licensing information as possible, and a big part of that plan is using Commons a lot. We have real cases of pages that contain 50+ images from Commons (like this one: https://nonsa.pl/wiki/Kodeks_drogowy) and they regularly cause 504 gateway timeouts.

Commons brings down entire wikis with it

If Wikimedia Commons is unavailable for some reason, our wiki becomes completely unusable. Every time a page is parsed, it has to sequentially wait for each image's information request to time out, which causes the page parse to time out and halts the entire wiki. This… is more frequent than you'd think, it happens at least once a week and I estimate Commons accounts for about 90% of our unplanned downtime.

This is bad. I'm not asking for Commons to become more reliable (hey, it's free, right?), but remote wikis need a better way for dealing with these outages, as well as outages in other foreign repos they might be using.

Possible solutions

Here I suggest a few solutions, i.e. tasks that I could do in the project to alleviate the above problems. I've listed them roughly from most to least sensible/possible/useful.

Batch imageinfo requesting in Parsoid

There isn't really a point in optimizing the current parser, as it's going away anyway, so I propose changing the image rendering code in Parsoid to lay off doing any image information requests to the last, postprocessing stage in the pipeline, somewhere around redlink checking. I think that has already been proposed (comments in code suggest that), but never implemented. Some necessary code (like the ability to make batch requests) is already in place, so this probably wouldn't require deep changes to Parsoid.

As far as I understand Parsoid's nature, rendering images is context-free and does not depend on anything that is not contained within the [[File: (...)]] tag (after expansion, of course), so that should be possible and not break anything. Of course, everything would have to be tested thoroughly.

This would help drastically reduce parse times for pages with a lot of images in foreign repos.

This would also help reduce the load on Commons' public API a bit, in the long term, so Wikimedia might have a small private interest in this :)

Edit: I forgot about multiple density thumbnails, they would slow down the process anyway by being serial. Currently there is no way around this, this would require rewriting an API module, which I describe below.

Rewrite foreignrepo API

See this task for more information: T89971

This is required to make the above fix really work (I think). Currently thumbnails of different sizes have to be requested is separate requests due to constraints in the imageinfo API. The proposed set of features for such an API module is quite well described in the above and related tasks, so that shouldn't be too hard, just time-consuming.

Better ForeignAPIRepo client-side caching (somehow)

This is arguably rather vague, but better here means any form of sensible caching, as all caching is currently disabled by default on InstantCommons (see: T235551). Also when one forces the caching to work, it turns out to be rather… simplistic and unable to handle many cases well.

This is to alleviate the second problem I've described above.

I have three propositions for how to solve this:

  • A fixed lifetime cache periodically refreshed by a background job. Images to refresh would be put in the job queue when they approach their lifetime, in batches to speed up the process. When an image has changed, a reparse/links update job is put in the queue for all pages that use that image. The job will be attempted many times in case it fails, to be able to cope with availability problems of the foreign repo. After some thinking I came to the conclusion that this is the most sensible way of doing it, but I include two other propositions as well.
  • A simple optimistic cache mechanism. When rendering the page, the parser would just assume the version of image info in its cache is valid (well, unless its cache is empty for this particular image) and return that. After doing that, it would enqueue a task in the job queue to do strict pessimistic checks against the foreign repo for these images. If it fails, the task is put back in the queue. If it succeeds and the images have in fact changed in the meantime, the page is reparsed and the corrected version is presented to the users. This can be also combined with the existing fixed lifetime caching policy, which would override the optimistic cache when applicable.
  • A slightly more complicated approach would be to use the same optimistic cache mechanism, but in conjuction with a background pingback service, that would periodically (let's say, every minute) check if the foreign repo is still available. If it is, it would omit the optimistic cache policy and let requests go through to the foreign repo. Otherwise, it would return values from cache while warning the user that the image information might be out-of-date.
Fix Special:Wantedfiles

This special page is completely non-functional on non-WMF sites that utilize foreign repos. It lists all foreign images making it impossible to find images that are really missing from the wiki. I would hope to fix that through… some means. I'm not sure yet, more investigation needed, but an improved cache would possibly allow for fast checking whether the image really is missing, or is in the foreign repository.

Audit other code utilizing ForeignAPIRepo

I'm not sure how much code using the ForeignAPIRepo there is in the wild, but I could possibly try to find problematic (serial) code and try to introduce request batching, where possible. This is rather low-priority, though.

Deliverables

TBD, I have to consult a few things with people wiser than me first.

Participation

To schedule tasks, report progress and do most other dev things I will use Phabricator. The code will of course be in Wikimedia's Gerrit. As for communication I prefer IRC and email, but I can use any other means of communication.

About Me

I study at the Warsaw University of Technology. I have a BSc in Computer Science and I am currently pursuing a MSc in Data Science (a CS equivalent). I think I stumbled upon the GSoC program accidentally while browsing mediawiki.org. During the summer I will not have any other significant commitments, I may be unavailable for a few days here and there, but I will definitely deliver everything on time :)

I'm interested in the idea of free culture and software, I currently lead a project dedicated to free humor (more in the next section). I think free culture and open-source software are vital for humanity and part of my motivation for this project is making free media from Commons more accessible.

Past Experience

I've been for over a year a system admin of a medium-sized wiki – Nonsensopedia, which is kind of like Uncyclopedia in Polish, but completely different in some regards, most notably it puts a much larger focus on proper licensing and making sure everything there really is free and funny. It also has much stricter standards regarding hate speech and controversial stuff.

By being a sysadmin I gathered a lot of experience with installing and maintaining MediaWiki. Over the last year I also wrote a few MW extensions for Nonsensopedia that could be useful for other wikis as well, you can find them listed on my user page. All MW-related code I wrote is on our GitLab group, including some forks of extensions and other tools.

I also wrote some patches for MW and other extensions. This includes T246127, T231481, T240893, T205219, d7ff338a4cb3, T228584 and T228579. There are also a few patches in different states that weren't merged (yet).

I also attended the Wikimedia Hackathon in 2019, which really got me interested in MW development. I also met a lot of interesting people there, that have helped me do a lot of this stuff, thank you! :) Unfortunately I can't attend this year's Hackathon in Tirana.

As for other programming experience I wrote things in all kinds of languages (C, C++, C#, PHP, JS, Lua, Python, R, Matlab, Forth, 6502 assembly, yeah, I'll stop), but of course I am not proficient in all of them :) I did a lot of university and hobby projects and I also wrote a few commercial applications (databases, production management, webdev, office automation). No programming can of worms scares me :)

Event Timeline

I am not sure if the above proposal is sensible or not in technical terms and some feedback from people that know these modules well would be much appreciated :) I am also looking for people willing to mentor this project.
I contacted @srishakatux before regarding this project and she recommended me to ask a few people regarding that, so I'm tagging them here.
@ssastry @cscott @Anomie @cicalese

I propose changing the image rendering code in Parsoid to lay off doing any image information requests to the last, postprocessing stage in the pipeline, somewhere around redlink checking. I think that has already been proposed (comments in code suggest that), but never implemented. Some necessary code (like the ability to make batch requests) is already in place, so this probably wouldn't require deep changes to Parsoid.

Parsoid already does batch processing of images right at the end in the AddMediaInfo pass. See https://github.com/wikimedia/parsoid/blob/b034fd98c60c2b8de8e6014c4f2ef816c48755fa/src/Wt2Html/PP/Processors/AddMediaInfo.php @Arlolra, are there any caveats here?

Also, while this isn't immediately relevant for the GSoC project proposal, do note that it may be 15-18 months or longer before Parsoid replaces the MediaWiki core parser in the Wikimedia cluster. And, for third party wikis, there is the added caveat their own wikis + extensions should be ready for Parsoid.

Anyway, so it looks like the bulk of the proposed work is going to be in the ForeignRepo related code in MediaWiki (changing API so batching isn't defeated by dimension parameters, and caching).

@Arlolra, are there any caveats here?

T153080 moved the imageinfo requests from the token stream to a post-processing pass. In Parsoid/JS, the network requests got batched there. However, in Parsoid/PHP, it looks like the calls to getDataAccess()->getFileInfo happen in a loop. I'm not sure if passing an array of all the titles we need would result in batched database lookup, but it at least seems helpful for fetching from commons.

@Arlolra, are there any caveats here?

T153080 moved the imageinfo requests from the token stream to a post-processing pass. In Parsoid/JS, the network requests got batched there. However, in Parsoid/PHP, it looks like the calls to getDataAccess()->getFileInfo happen in a loop. I'm not sure if passing an array of all the titles we need would result in batched database lookup, but it at least seems helpful for fetching from commons.

Parsoid however, isn't in the critical path for rendering pages for view at present (I don't know what the full plans with the PHP port is though). From a speed perspective, the old parser is much more relevant. The main issue being that it calls Parser::fetchFileAndTitle() which calls $repoGroup->findFile one at a time, where each call makes a separate HTTP request (For metadata, as it needs to know file dimensions, and if it exists or not, etc. Actual files are not fetched at this point in default config). Since this is sending a request all the way to commons, each request is several RTT of latency. MediaWiki is (AFAIK) not smart enough to use HTTP pipelining or anything - so I think every time an image is used, you have a new TCP handshake, new TLS handshake, new http connection. This adds up quickly.

So I think an obvious improvement here would be to have the parser not process makeImage() immediately, but add a marker, and then replace all the markers at once (similar to what's done for links). I think there are a lot of other things more generally that could be done to improve instantcommons. Its a feature that is widely used, but doesn't get much love.

Parsoid however, isn't in the critical path for rendering pages for view at present (I don't know what the full plans with the PHP port is though).

The current plan is to pursue switching over everything to Parsoid on the Wikimedia cluster in the 15-18 month timeframe -- or at least pursue that timeline and see where we land.

The current plan is to pursue switching over everything to Parsoid on the Wikimedia cluster in the 15-18 month timeframe -- or at least pursue that timeline and see where we land.

Right, so that's why I propose only changes to Parsoid. We'll have to put up with the existing parser for now. Porting some of these improvements to the old parser is uh... maybe possible? I'd consider that second priority. Also improving perf on Parsoid may prove an additional incentive for external wikis to switch over to it :)

Ostrzyciel closed this task as Declined.EditedMar 28 2020, 9:23 AM

I am forced to abandon this proposal, as there was nobody willing to mentor this project.

Thank you for all your input and technical insights! :)

@Ostrzyciel: I'm sorry to see this, as this is a great tech-savvy proposal, but feels like unfortunate timing (which is not your fault, of course) given the big picture.
I do hope that you will propose this again. :-/