Page MenuHomePhabricator

Translate extension tags make page loading pass from 7 to more than 25 seconds loading
Closed, DeclinedPublic

Description

I wanted to prepare the (long) page https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples page for translation. It inserted quite a big numbers of "translate" tags as I did only wanted to translate some parts of the queries exemplified (the comments mainly) and not the rest to help people understant how to use sparql for queries.

One amongst other problem appeared, which does not ease the work, is that the loading of the page takes then almost half a minute to load ... and to find mistakes in the translate tags - see revision https://www.wikidata.org/w/index.php?title=Wikidata:SPARQL_query_service/queries/examples&oldid=390053650
also see the discussion in https://www.wikidata.org/wiki/Wikidata_talk:SPARQL_query_service/queries/examples#Translation

(on a side note, one of the other problems is that the translate tags and metadatas are viewed as a burden by people who prefers cancelling translation than to deal with the tags - I'll fill another bugreport)

Event Timeline

https://performance.wikimedia.org/xhgui seems broken so I cannot see currently what is happening with parsing of that page.

I was able to to run xhgui now. I loaded https://www.wikidata.org/w/index.php?title=Wikidata:SPARQL_query_service/queries/examples&oldid=390053650 with profiling enabled. It timed out (no surprise there), but produced some information: https://performance.wikimedia.org/xhgui/run/view?id=5fe19b11f8d90fbf5d517e91

Almost all of the time is spend in MediaWiki\Shell\Command::execute. The only shell caller in Translate is about Yaml parsing, which I am fairly sure is not happening here. If you sort by inclusive wall time, you can see the culprit seems to be SyntaxHighlight::highlight. It is possible that Translate makes loading this page slower, but not by doing something slow, but rather interfering with parsing somehow that some cache may be skipped or that sorts. If evidence of this is found, please re-tag Translate, but for now I am handing this over to the performance team to investigate further.

The page has over 100 syntax highlight sections. As I understand it, those have always been slow to parse and is not a regression and not related to the Translate extension.

In terms of general mitigation our team can help with:

  • I note that the oldid= link is consistently slow to respond, which suggests that ParserCache and/or CDN cache is disabled for these. This is a known issue and more or less by design (ref T244058), because we literally have billions of revisions for all pages, and we don't plan to buy capacity to store all of these for long periods of time. The work of T244058 does cover allowing some of them to be stored for a short period of time (up to an hour), however that is not motivated by end-user performance, but by infrastructure resources. If a permalink becomes a trending URL, we need to make sure it does not draw significant resources from the servers on every access attempt.
  • For the regular page url (without oldid) I found that ParserCache works as expected. The Translate extension does not appear to be causing an issue there,
  • Parsing of this examples page rendered in about 20 seconds for me on first try. But after purging them a few times, I did find that on some attempts I got a WMFTimeoutException (meaning it took more than 60 seconds). This is a problem, because it means it cannot be cached in that case and thus will spend the same expensive resourcing a second time. In terms of security this is not an issue because PoolCounter protects against this, but it is still problematic of course to allow pages to be made that can't render consistently. This is meant to be prevented by determinitic counting of "expensive' parser functions, which we have a limit on. If SyntaxHighlight is not counted as "expensive", this should be fixed. If it is, then perhaps we need to increase its weight and/or lower the threshold to disallow pages with more than a certain number of syntax highlight sections.

As for the performance of SyntaxHighlight itself, I do note that its own local caching already (source), which allows individual sections to be cached and re-used across different edits. This means that, in theory, if you make a small edit to the page, it will not have to re-parse all sections.

There is definitely a lot of ways in which SyntaxHighlight could perform better. For example, the upstream :python-pygments" software could be optimised to spawn faster for small inputs, or it could be improved to support batching, or Parsoid could support batching, or it could be ported or swapped for a plain PHP implementation. That kind of product development would require resourcing, and the extension currently does not have an active steward. This is unlikely to change from the perspective of Wikipedia or MediaWiki at-large since articles rarely use this, and we have pretty good caching for those.

The main area where this issue is visible is the documentation pages for use within our technical communities, rather than the main knowledge content we serve. In the past two years we have significantly increased investments in technical outreach for Wikidata and MediaWiki more generally, with API documentation etc. Perhaps it could be funded from that perspective.

I've filed T271751 for the follow-up idea to improve SyntaxHighlight performance more generally. I'm declining this task as I'm not aware of any direct issue here in relation to Translate or MediaWiki in general.

We allow users to create pages that are slow, this is a trade-off in what we invest in with as benefit that a lot of things are possible (albeit in degraded form), which is better than the alternative (we think) of disallowing these with strict performance requirements, in which case we would e.g. not allow pages to have more than a handful of code examples. I think we can do better in showing editors why a page is slow and what they can do about it. For example, editors may want to split up a page into more accessible sub pages. It's a trade-off, and the choice is theirs to make.