Page MenuHomePhabricator

SyntaxHighlight extension is slow (needs more active stewardship)
Open, Needs TriagePublic

Description

Background

Follow-up from T148442. In a nut shell:

SyntaxHighlight is a simple extension that uses the Pygments open-source project. It works "okay" one or twice on some page, and works "okay" when used many times on a popular and well-cached Wikipedia article.

Its performance is terrible when it is used extensively on technical documentation pages due to each syntax highlight block adding a high cost slowing down the backend page parsing time. And because these pages are relatively unpopular (compared to Wikipedia) their readers are more often the first to visit such page after the cache expires, thus getting a fresh parse upon view.

Worse, there are pages that have become inaccessible due to timeouts (page parsing taking more than the 60 second limit), that power users may try to reload many times in the hopes that one random attempt will take less time and prime the cache for a few hours or days before it fails again.

Example: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples

Oppertunity

There are definitely a lot of ways in which SyntaxHighlight could perform better. For example, the upstream "python-pygments" software could be optimised to spawn faster for small inputs, or it could be improved to support batching, or Parsoid could support batching, or it could be ported or swapped for a plain PHP implementation. That kind of product development would require resourcing, and the extension currently does not have an active steward.

Event Timeline

Krinkle added a project: Developer-Advocacy.
Krinkle added subscribers: srishakatux, Bmueller.

I think this is unlikely to receive resourcing from the perspective of Wikipedia and sister projects for open knowledge and open data, since the performance overhead is not noticable when it is used only once or twice on a page, even for unpopular pages.

@Bmueller @srishakatux This might be a good oppertunity for your team to directly or indirectly work on, I think it would directly benefit access, appeal and usability of documentation and technical outreach for Wikidata, and MediaWiki APIs in general, especially since that kind of content often uses (and thus suffers) from SyntaxHighlight.

If startup time is the main problem, one potential scaling hack would be to introduce a mode for the extension that calls out to a microservice instead of performing a shell exec. It should be trivial to write a tiny flask app that exposes an HTTP API to convert a posted block of code into rendered HTML.

From #mediawiki-core yesterday

14:57:05 <legoktm> I don't know why ori decided to create a zip bundle, I do remember that he wrote a small wgsi service for pygments that he decided wasn't necessary because shelling out had good enough performance
17:33:26 <AaronSchulz> legoktm: is that code anywhere?
17:34:30 <legoktm> AaronSchulz: https://gerrit.wikimedia.org/r/c/mediawiki/services/pyglet/+/220686 it's really trivial

Resurrecting/building said service should be trivial today and could be deployed on k8s.

From #mediawiki-core yesterday

14:57:05 <legoktm> I don't know why ori decided to create a zip bundle, I do remember that he wrote a small wgsi service for pygments that he decided wasn't necessary because shelling out had good enough performance
17:33:26 <AaronSchulz> legoktm: is that code anywhere?
17:34:30 <legoktm> AaronSchulz: https://gerrit.wikimedia.org/r/c/mediawiki/services/pyglet/+/220686 it's really trivial

Resurrecting/building said service should be trivial today and could be deployed on k8s.

An easier approach that might be worth considering first is to install python-pygments the "normal" way as package on app servers and/or the shellbox service. E.g. not as zipped binary, but as regular python package, and point $wgPygmentizePath to it in production. It doesn't really matter if it's the exact same version or not, so long as it's compatible.

Aaron mentioned in our team meeting that that would likely save a fair amount of startup overhead by not unzipping the binary on every shell call.

An easier approach that might be worth considering first is to install python-pygments the "normal" way as package on app servers and/or the shellbox service. E.g. not as zipped binary, but as regular python package, and point $wgPygmentizePath to it in production. It doesn't really matter if it's the exact same version or not, so long as it's compatible.

We'd have to figure out how to keep the generated CSS and language list in sync with a potentially different version of pygments. Maybe we shell out to pygments instead of keeping it in the repo and then heavily cache the results of that?

@Bmueller @srishakatux This might be a good oppertunity for your team to directly or indirectly work on, I think it would directly benefit access, appeal and usability of documentation and technical outreach for Wikidata, and MediaWiki APIs in general, especially since that kind of content often uses (and thus suffers) from SyntaxHighlight.

The Developer-Advocacy team discussed this suggestion in our 2021-01-14 team meeting. We do agree that SyntaxHighlight has a lot of value for our technical documentation. Unfortunately we cannot currently take responsibility as a team for maintaining a MediaWiki extension. At the current time we also do not believe we can appropriately support the search for a new "owner" in the form of a Foundation or affiliate dev team.

Now that I have posted T271751#6752252 wearing my Foundation staff hat, I can also say that I would be happy to help any effort to improve the performance of SyntaxHighlight in my volunteer hours.

From #mediawiki-core yesterday

14:57:05 <legoktm> I don't know why ori decided to create a zip bundle, I do remember that he wrote a small wgsi service for pygments that he decided wasn't necessary because shelling out had good enough performance
17:33:26 <AaronSchulz> legoktm: is that code anywhere?
17:34:30 <legoktm> AaronSchulz: https://gerrit.wikimedia.org/r/c/mediawiki/services/pyglet/+/220686 it's really trivial

Resurrecting/building said service should be trivial today and could be deployed on k8s.

This seems like a reasonable idea to me. I did not remember that Ori had written a thing like this, but after my comment at T271751#6741098 I built a quick proof of concept that actually is fairly similar code at https://pygments.toolforge.org/. The source for my poc tool is found in R2913 tool-pygments.

Now that I have posted T271751#6752252 wearing my Foundation staff hat, I can also say that I would be happy to help any effort to improve the performance of SyntaxHighlight in my volunteer hours.

From #mediawiki-core yesterday

14:57:05 <legoktm> I don't know why ori decided to create a zip bundle, I do remember that he wrote a small wgsi service for pygments that he decided wasn't necessary because shelling out had good enough performance
17:33:26 <AaronSchulz> legoktm: is that code anywhere?
17:34:30 <legoktm> AaronSchulz: https://gerrit.wikimedia.org/r/c/mediawiki/services/pyglet/+/220686 it's really trivial

Resurrecting/building said service should be trivial today and could be deployed on k8s.

This seems like a reasonable idea to me. I did not remember that Ori had written a thing like this, but after my comment at T271751#6741098 I built a quick proof of concept that actually is fairly similar code at https://pygments.toolforge.org/. The source for my poc tool is found in R2913 tool-pygments.

Ooh, nice! So I think we would want the following endpoints:

  • POST /render/ (implemented)
  • GET /css - returns generated CSS
  • GET /lexers - returns lexer list
  • GET /healthz - for monitoring

Do we need the forms and CSRF protection if this isn't intended to have a user-facing interface? (or do you want that functionality?)

I also think the mapping of deprecated/removed languages (mostly from GeSHi) to current ones should be in the service so we can add new ones as we upgrade pygments. It might be worth looking into whether we can just upstream the extra aliases we have to pygments itself. Maybe @Proc would be interested in that since they recently did some upstream stuff?

In SyntaxHighlight, we would need to adopt the RL module, lexer list and actual shell out to make (cached) HTTP requests to a service instead of using the hardcoded/bundled stuff (which should be a fallback for non-service users).

I also wonder if we should have a filter in SyntaxHighlight to sanitize the HTML response if there was an XSS in pygments or in case of MITM.

Ultimately I think this is a better track to go down rather than trying to adapt SyntaxHighlight+pygments for Shellbox.

This seems like a reasonable idea to me. I did not remember that Ori had written a thing like this, but after my comment at T271751#6741098 I built a quick proof of concept that actually is fairly similar code at https://pygments.toolforge.org/. The source for my poc tool is found in R2913 tool-pygments.

Ooh, nice! So I think we would want the following endpoints:

  • POST /render/ (implemented)
  • GET /css - returns generated CSS
  • GET /lexers - returns lexer list
  • GET /healthz - for monitoring

That seems like a reasonable list of actions that would be need for hilightoid®. I think I would add a GET /schema endpoint for an OpenAPI spec describing the other endpoints.

Do we need the forms and CSRF protection if this isn't intended to have a user-facing interface? (or do you want that functionality?)

I don't think it would be needed for a deployed service. I mostly stuck it in the POC to make playing with it easier.

I also think the mapping of deprecated/removed languages (mostly from GeSHi) to current ones should be in the service so we can add new ones as we upgrade pygments.

+1

It might be worth looking into whether we can just upstream the extra aliases we have to pygments itself. Maybe @Proc would be interested in that since they recently did some upstream stuff?

Upstream is always better when possible. :)

In SyntaxHighlight, we would need to adopt the RL module, lexer list and actual shell out to make (cached) HTTP requests to a service instead of using the hardcoded/bundled stuff (which should be a fallback for non-service users).

I also wonder if we should have a filter in SyntaxHighlight to sanitize the HTML response if there was an XSS in pygments or in case of MITM.

Ultimately I think this is a better track to go down rather than trying to adapt SyntaxHighlight+pygments for Shellbox.

Taking a slow thing and sticking it behind another layer of indirection seems like it would not meet the goal of this task. So if it was for some reason better to use shellbox there should also be other performance changes made.

Ultimately I think this is a better track to go down rather than trying to adapt SyntaxHighlight+pygments for Shellbox.

Taking a slow thing and sticking it behind another layer of indirection seems like it would not meet the goal of this task. So if it was for some reason better to use shellbox there should also be other performance changes made.

Agreed, but its also important to use data to inform these claims and as we go along. While I understand that in principle Python functions should start faster from a running compared to a new one, it's not obvious to me that this will significantly faster and relevant in proportion to the goal at hand. If it means pages will still time out when they have roughly the same number of syntaxhighlight elements then it might not be worth investing in all this.

The maintenance cost of installing a Debian package and calling it is, is certainly much lower than developing, maintaining, documenting, testing, and deploying a new homemade microservice. Exciting as that sounds, I I fear that might be overkill, and could come at expensive of more worthwhile things we could be doing elsewherer.

Sorry if the Python angle of this is obvious, as I don't have a lot of experience scaling those. To me, it isn't obvious that it can't also actually be a net-loss in performance. Once you have a long running process one also needs to consider concurrency, request queues, perhaps load balancer or master thread indirection, pod creation (if auto-scale k8s) etc. That's not trivial to get right (as we've seen with Thumbor). The same will be true for Shellbox in principle, and eiher way it means the "start command vs start function" portion of the latency will become a smaller percentage after all this, and the potential gain also shrinks along with that.

Rough expectation:

  1. (Slow down) We can start treating pygmentize as external command instead of bundled, thus we need to cache the CSS, and figure out how to vary the parser cache and CSS module version on the version of pygmentize, and document which versions we support, with possibly an enforced/fail-fast minimum version. This can be pretty loose otherwise as our only API interface would be stdin and a few CLI options, which I expect would remain as stable if not more stable than the runtime API. These are significant changes we'd make regardless of how we call the Python code, so this is something we can already start doing today.
  2. (Speed up) No longer decompressing pygmentize at runtime. Right after point 1 we can stop paying this penalty by installing Debian's python3-pygments package on the app server as we do for other shell commands already. That should give us a notable speed up, according to @aaron.
  3. (Slow down) Once shellbox is up and running, it'll autmatically be used for this by MW (transition will likely not be optional for very long). That will naturaly incur some cost due to addition of network latency and runtime indirection of k8s, docker, load balancing, pod-scaling, and the HTTP service itself. These are similar if not identical costs to what a novel pygmentize microservice would incur so basically if we just wait and see we'll find out more or less for free how much that adds. We may want to add a timing metric in Graphite from MW's side for this if we don't have it already.
  4. (Potential speed up) Perhaps we can bench somewhere quick-n-dirty how difference the process startup makes, and then figure out who we'd build/own/maintain the service for it long-term with intial or on-going help from volunteers (incl staff volunteer-time :D).

That sounds like a good plan overall.

  1. (Slow down) We can start treating pygmentize as external command instead of bundled, thus we need to cache the CSS, and figure out how to vary the parser cache and CSS module version on the version of pygmentize, and document which versions we support, with possibly an enforced/fail-fast minimum version. This can be pretty loose otherwise as our only API interface would be stdin and a few CLI options, which I expect would remain as stable if not more stable than the runtime API. These are significant changes we'd make regardless of how we call the Python code, so this is something we can already start doing today.

I would suggest that if we detect that the bundled version is running, we use the generated lexer list and CSS so there's no perf hit at this step. Scribunto somewhat already does this.

  1. (Speed up) No longer decompressing pygmentize at runtime. Right after point 1 we can stop paying this penalty by installing Debian's python3-pygments package on the app server as we do for other shell commands already. That should give us a notable speed up, according to @aaron.

Using a Debian package will come with other costs too. If we want to stay on pygments 2.8.0, we'll need to maintain our own package (stretch has 2.2.0, buster 2.3.1, bullseye 2.7.1). Since it's pure Python it should be straightforward, but just another thing to do. Upgrading will require SRE intervention and be upgraded like any other OS package, but it means it can't progressively go out like we do currently with the train. Just something to keep in mind.

  1. (Slow down) Once shellbox is up and running, it'll autmatically be used for this by MW (transition will likely not be optional for very long). That will naturaly incur some cost due to addition of network latency and runtime indirection of k8s, docker, load balancing, pod-scaling, and the HTTP service itself. These are similar if not identical costs to what a novel pygmentize microservice would incur so basically if we just wait and see we'll find out more or less for free how much that adds. We may want to add a timing metric in Graphite from MW's side for this if we don't have it already.
  2. (Potential speed up) Perhaps we can bench somewhere quick-n-dirty how difference the process startup makes, and then figure out who we'd build/own/maintain the service for it long-term with intial or on-going help from volunteers (incl staff volunteer-time :D).

For reference: https://pythondev.readthedocs.io/startup_time.html

  1. (Speed up) No longer decompressing pygmentize at runtime. Right after point 1 we can stop paying this penalty by installing Debian's python3-pygments package on the app server as we do for other shell commands already. That should give us a notable speed up, according to @aaron.

Using a Debian package will come with other costs too. If we want to stay on pygments 2.8.0, we'll need to maintain our own package (stretch has 2.2.0, buster 2.3.1, bullseye 2.7.1). Since it's pure Python it should be straightforward, but just another thing to do. Upgrading will require SRE intervention and be upgraded like any other OS package, but it means it can't progressively go out like we do currently with the train. Just something to keep in mind.

T276298: Package latest python3-pygments for apt.wikimedia.org I'll leave it for someone else to take on #1.

T276298: Package latest python3-pygments for apt.wikimedia.org I'll leave it for someone else to take on #1.

Thanks. I'll give it a shot.

On my VM, the difference between using the zipped pygmentize that is bundled with the extension vs. the system one is negligible:

ori@mediawiki:~$ hyperfine "/srv/mediawiki/extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -l php -f html -o /dev/null /srv/mediawiki/rest.php" "/usr/bin/pygmentize -l php -f html -o /dev/null /srv/mediawiki/rest.php"

Benchmark #1: /srv/mediawiki/extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -l php -f html -o /dev/null /srv/mediawiki/rest.php
  Time (mean ± σ):     205.1 ms ±  16.8 ms    [User: 183.2 ms, System: 20.2 ms]
  Range (min … max):   186.4 ms … 255.9 ms    14 runs

Benchmark #2: /usr/bin/pygmentize -l php -f html -o /dev/null /srv/mediawiki/rest.php
  Time (mean ± σ):     193.5 ms ±  10.6 ms    [User: 169.3 ms, System: 22.6 ms]
  Range (min … max):   178.1 ms … 214.4 ms    15 runs

Summary
  '/usr/bin/pygmentize -l php -f html -o /dev/null /srv/mediawiki/rest.php' ran
    1.06 ± 0.10 times faster than '/srv/mediawiki/extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -l php -f html -o /dev/null /srv/mediawiki/rest.php'

Do we have any evidence that SyntaxHighlight performance is a problem for users, besides https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples ?

Unscientific samples:

  1. https://www.mediawiki.org/w/index.php?title=Extension:WikiLambda&action=edit
  2. Replace {{ExtensionInstall|registration=required|db-update=1}} with:
<syntaxhighlight lang="php">
wfLoadExtension( 'WikiLambdaABC' );
</syntaxhighlight>
  1. "Show preview" and note the parse time.
  2. Repeat three times with different letters (to force a cache miss, to repro, use a different sequence..)
  3. Repeat three more times different ways, with <pre></pre> instead of <syntaxhighlight lang="php"></syntaxhighlight>.
  • with syntaxhighlight: 934ms (mw1351), 884ms (m1324), 737ms (mw1331), 1115ms (mw1327).
  • with pre: 401ms (mw1397), 472ms (mw1322), 502ms (mw1319).

cap.png (1×876 px, 106 KB)

via script.

The difference seems significant, and much bigger than makes sense for a single one-line invocation. I am probably hitting something silly here or having really bad luck.

I tried to reproduce and got different results:

  • With SyntaxHighlight: 1393ms, 1608ms, 1599ms (average: 1533ms)
  • With pre: 1010ms, 1108ms, 1343ms (average: 1154ms)

I did not make note of the app servers generating each response.

That's still a difference of ~400ms, which is significant, and significantly more than the 200ms of wall-clock time for a direct CLI invocation of Pygmentize, which is what I measured on my VPS (which is much less powerful than an app server). That does seem odd.

@bd808 a microservice like the one you whipped up sounds like the right idea going forward. Are you still up for it? I'd be happy to help somehow.

@bd808 a microservice like the one you whipped up sounds like the right idea going forward. Are you still up for it? I'd be happy to help somehow.

I can certainly build out a slightly more mature strawdog implementation that is setup to deploy as a container as a weekend project. From there the big lift would be all the dotting of i's and crossing of t's to get it actually deployed and the MediaWiki side glue written. Right now I don't have the emotional energy to do those bits, so if others wanted to pitch in on that direction it would be helpful. I suppose we will need to satisfy someone about the long term maintenance of a new service too. I don't anticipate it as being any real long term burden at the moment, but we've got a lot of shared trauma about "permanent ownership" of things to work through.

This should be re-evaluated now that the conversion to Shellbox is done.

Addressing "SyntaxHighlight extension is slow", we can:

  • add some more Shellbox replicas and/or php-fpm workers
  • look into improving pygments' CSS/JS/Lua lexers upstream since those are what's most commonly used from skimming the logs
  • implement a proper pygments-server that handles requests directly. The main thing we cut out is the act of shelling out, and probably some indirection with Apache/PHP-FPM. We'd want to make sure we retain per-request CPU/memory/time limits given pygments' history of ReDoS attacks

For large pages like https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples which have a ton of <syntaxhighlight> tags, another strategy could be to globally keep track of how much time has been spent so far, and once we're at say, 75% of the request time limit, bail and just format the remaining tags as plain text.

We have some timing metrics on the Shellbox side at https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main and from the appserver side at https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=shellbox-syntaxhighlight&from=now-1h&to=now

Perhaps some sort of batching would be possible as well.

The idea of batching kind of against the grain (both for for parser extensions, and for Shellbox). But... Shellbox does use HTTP underneath so it wouldn't be too far fetched to utilize MultiHttpClient in some way. And, as far as parser extensions go, this one doesn't do any recursion so some kind of early preload batch that populates a keyed array to re-use during the actual parser hook callback might work (or build up batch and re-insert in a late step but that seems more fragile as it would then rely on it for correctness vs only as preload optimisation).

Regarding batching, Math does something similar today and T268785: IDEA: Move parallel tag parsing logic from Math to core exists to move it to core.