Page MenuHomePhabricator

"Math extension cannot connect to Restbase." error in Wikimedia projects
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:
Search for "Math extension cannot connect to Restbase." in Wikipedia

What happens?:
One current example is: https://fr.wikipedia.org/wiki/Action_par_conjugaison
It contains this message in red:

Échec de l’analyse (SVG (MathML peut être activé via une extension du navigateur) : réponse non valide(« Math extension cannot connect to Restbase. ») du serveur « http://localhost:6011/fr.wikipedia.org/v1/ » :): {\displaystyle \operatorname{aut}_g=\operatorname{aut}_{g'}}

Note: Wikipedia:Purge is a pragmatic way of fixing this for one site (therefore many search results do not contain the problem anymore), but the problem reappears in new pages (it started over 24 h ago)

(include links if applicable):
I opened a topic in en.Wikipedia for this here:

Event Timeline

Convenience link for enwiki search (47 results): https://en.wikipedia.org/w/index.php?go=Go&search=%22Math+extension+cannot+connect+to+Restbase.%22&title=Special:Search&ns0=1

A lot of those are not current. Clicking on the result loads a page without the error message currently on it. Probably got cached when the error was occurring.

The problem is still there, here are some numbers for different projects obtained with the search function: en (61), es (54) and fr (50).

As a layman I have questions with respect to the following change:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/945814

In MathRenderer.php a change was made to public function readFromDatabase():

  • Does merge on Aug 05 imply that this change is "live in Wikipedia" since Aug 05?
  • The commit message says: Avoid DB acces in non-Database tests; Is it obvious that the change will only have an effect on tests, but not on Wikipedia articles?

As a layman I have questions with respect to the following change:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/945814

In MathRenderer.php a change was made to public function readFromDatabase():

  • Does merge on Aug 05 imply that this change is "live in Wikipedia" since Aug 05?
  • The commit message says: Avoid DB acces in non-Database tests; Is it obvious that the change will only have an effect on tests, but not on Wikipedia articles?

No, that's a change which only landed in the development branch on Sunday and won't be in production until next week (there's no deployment train this week). It's not relevant to this task.

This sounds very much like a production availability glitch, but I thought we'd dropped all the RESTBase-facing code from production?

In MathRenderer.php a change was made to public function readFromDatabase():

  • Does merge on Aug 05 imply that this change is "live in Wikipedia" since Aug 05?

No, see the roadmap. My change is not live anywhere at the moment.

  • The commit message says: Avoid DB acces in non-Database tests; Is it obvious that the change will only have an effect on tests, but not on Wikipedia articles?

The change is a noop for non-test code, and so much so for production. Even if it weren't, it only changes the method used to connect to the wiki database, not to restbase.

edit: Sorry, I didn't see the previous comments. I'll leave this here since I added some details.

As a layman I have questions with respect to the following change:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/945814

In MathRenderer.php a change was made to public function readFromDatabase():

  • Does merge on Aug 05 imply that this change is "live in Wikipedia" since Aug 05?

No. A new version of the software is deployed once a week, which includes all of the code changes merged that week, and which is released to smaller projects first and Wikipedias later (this is called the "deployment train"). There's a bit of documentation about this here: https://wikitech.wikimedia.org/wiki/Deployments/Train. This change should go live on Wikipedias this Thursday, 10 August next Thursday, apparently.

You can find out whether a change is deployed on a wiki in two steps:

  1. Check which branch is currently deployed on the wiki using Special:Version (in this case, 1.41.0-wmf.20):
    image.png (271×559 px, 26 KB)
  2. Compare to the list of branches shown when you click "Included in" on Gerrit, in the three dot menu in top-right:
    image.png (593×857 px, 23 KB)
    image.png (265×831 px, 15 KB)

In this case, the change is not included in 1.41.0-wmf.20, so it's not live on Wikipedia. Later today, a new branch 1.41.0-wmf.21 will appear in the list on Gerrit, and later this week it will appear on Special:Version on different wikis as it gets deployed there.

  • The commit message says: Avoid DB acces in non-Database tests; Is it obvious that the change will only have an effect on tests, but not on Wikipedia articles?

The change to src/MathRenderer.php could theoretically affect Wikipedia articles, and not just tests, but in this case it's a trivial maintenance change, and I'm sure it won't affect this bug in any way.

Thanks for the quick answer. So the version that "went live" in the last days in all wikipedias is 1.41.0-wmf.20. And this version contains no changes in the Math extension.

This sounds very much like a production availability glitch, but I thought we'd dropped all the RESTBase-facing code from production?

No RestBase is still used. The progress here is very slow. We finally got rid of Math PNG images in RESTBase though. Shortly after that, we observed a performance regression, cf. T341666.

edit: Sorry, I didn't see the previous comments. I'll leave this here since I added some details.

Thank you for the added details, certainly much better than my hasty reply :) Just one correction to what you said: this week there'll be no train (T340249).

Errors due to RESTbase have suddenly shot up in the past couple of weeks. I monitor https://en.wikipedia.org/wiki/Category:Articles_with_math_errors and there are normally a couple a day, and it was easy to keep the category mainly empty.

But now it has 57 entries, all new since Sunday, and 4 new entries in the last hour alone. https://phabricator.wikimedia.org/F37486248

At this rate, it becomes increasingly hard to manually keep on top of the errors.

I think I found out that this problem started before August 2023: The html source of articles contains a timestamp from the parser. For the article https://pt.wikipedia.org/wiki/Esquema_de_Horner with the problem the timestamp is:
<!-- Saved in parser cache with key ptwiki:pcache:idhash:3991620-0!canonical and timestamp 20230727175816 and revision id 63128952. Rendering was triggered because: page-view
-->
I suppose timestamp 20230727175816 means July 27 for the day.

Rate of errors on the English Wikipedia seems to have gone back to normal now.

As nothing was changed from the math extension code, it suggests that the problems are related to the underlying infrastructure. So I suggest to close this issue and reopen if required.

Physikerwelt claimed this task.

I want to point out that the problem is still present in many Wikimedia projects, there are currently 105 pages found in de.Wikipedia:
https://de.wikipedia.org/w/index.php?search=%22Math+extension+cannot+connect+to+Restbase.%22&title=Spezial%3ASuche&ns0=1

The problem appears outside of Wikimedia projects as well: The page
https://math.fandom.com/zh/wiki/Lp_%E7%A9%BA%E9%97%B4
on math.fancom.com currently consists of over 80% of error messages in red, which demonstrates the extent of the problem. The MediaWiki version used by fandom.com is 1.39.4 according to https://community.fandom.com/wiki/Special:Version.

I don't know whether the problems on fandom and Wikimedia projects have the same origin. If yes, the problem was present in MediaWiki 1.39 too.

I want to point out that the problem is still present in many Wikimedia projects, there are currently 105 pages found in de.Wikipedia:
https://de.wikipedia.org/w/index.php?search=%22Math+extension+cannot+connect+to+Restbase.%22&title=Spezial%3ASuche&ns0=1

Is any of these current? I checked https://de.wikipedia.org/wiki/Gleichung_x%CA%B8_%3D_y%CB%A3 and could not find problems.

The problem appears outside of Wikimedia projects as well: The page
https://math.fandom.com/zh/wiki/Lp_%E7%A9%BA%E9%97%B4
on math.fancom.com currently consists of over 80% of error messages in red, which demonstrates the extent of the problem. The MediaWiki version used by fandom.com is 1.39.4 according to https://community.fandom.com/wiki/Special:Version.

I don't know whether the problems on fandom and Wikimedia projects have the same origin. If yes, the problem was present in MediaWiki 1.39 too.

https://math.fandom.com/zh/wiki/Lp_%E7%A9%BA%E9%97%B4?action=render looks like this after adding action=purge to the URL.

Screenshot 2023-08-26 at 08-11-47 https __math.fandom.com.png (4×4 px, 1 MB)

There is a fairly constant error rate on the English Wikipedia, I would estimate about 1 or 2 a day. I spend a few minutes a day clearing out
https://en.wikipedia.org/wiki/Category:Articles_with_math_errors

The errors will arise even without anyone editing the page, as the whole page gets re-rendered periodically, but I don't know how often that happens. This also mean errors will go away eventually.

I don't think the errors are anything to do with the code base, more problems with availability of the server.

I've been thinking of a script to clear out these pages automatically. I currently use the following bit of JavaScript that adds a couple of links to make purging and null edits easier.

$(mw.util.addPortletLink('p-cactions', '#', 'Null edit')).click(function(e) {
	e.preventDefault();
	new mw.Api().edit(mw.config.get('wgPageName'), function(rev) {
		return rev.content;
	}).then(function() {
		window.location.reload();
	});
});
   
   var pgname = mw.config.get( 'wgPageName' );
		console.log(pgname);
		var mpurge = $("<li></li>").addClass("mw-list-item");
		var linstr= '<a href="/w/index.php?title='+pgname+'&amp;action=purge&amp;mathpurge=true">mathpurge</a>';
		console.log(linstr);
		var plink = $(linstr);
		mpurge.append(plink);
		//console.log(mpurge);
		$('#ca-purge').after(mpurge);

Potentially we could write a bot to clear out these errors. We have bot permissions on many wiki's.

A quick way to clear the entries in the tracking categories is to use AWB https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser
Set it to read from Kategorie:Wikipedia:Seite mit Math-Fehlern
not make any changes to the page, but never skip. This will make a null-edit to the page which also seems to force a re-rendering of the page clearing out RESTbase errors. These edits will not appear in your contributions. You do need to apply to use AWB on the wiki you wish to make edits.

The rate of error is still very high. Today 293 new errors appeared. It has been averaging about 80 a day on the English Wikipedia for the last few week, up considerably from the 10/day rate back in July.

If I look in logstash, i see quite a few "Rendering failed. Printing error message." warnings, but not nearly as many underlaying errors that would be the cause of hitting that.

Perhaps we should have https://gerrit.wikimedia.org/g/mediawiki/extensions/Math/+/d50d3edaa85078cf8107e8d134b13673b8624745/src/MathRestbaseInterface.php#388

also for this block ?:
https://gerrit.wikimedia.org/g/mediawiki/extensions/Math/+/d50d3edaa85078cf8107e8d134b13673b8624745/src/MathRestbaseInterface.php#438

Another pretty common one is "Cannot get mml. Server problem." but this logline also doesn't seem to have any context that can be used to figure out the problem with the server. Best way forward is probably to ensure that context gets logged to logstash. If we know what kind of responsecode, which server etc then maybe we can find a better cause ?

I can't access the logs, however, if you are willing to analyze the log files I am happy to add more context to the error messages.

Just wanted to add an external voice to this task - we (at Weird Gloop) are observing "Math extension cannot connect to Restbase" on a regular basis on our hosted wikis (example link). $wgMathFullRestbaseURL hasn't been changed, so we're using Wikimedia's Restbase API for our rendering. We're considering running our own Restbase server if the instability seems to be related to WMF's infrastructure, but if there's a core issue with Restbase/the Math extension then this wouldn't do much for us. We'd also not really like to support Restbase going into the future, given that WMF is deprecating it.

Unfortunately, there's no obvious pattern to the issues, and we have no extra context/exception details that can be provided to help.

For what it's worth, there's plenty of error logging indicating that Math has trouble contacting RESTBase: https://logstash.wikimedia.org/goto/39fabf99bc87240c85260d60bbf85a15

…and that RESTBase has trouble contacting Mathoid: https://logstash.wikimedia.org/goto/a51e6c9a887ffb889ea2e259c78cea31

The most common errors look like these:

  • upstream connect error or disconnect/reset before headers. reset reason: connection failure
  • upstream connect error or disconnect/reset before headers. reset reason: connection termination

I don't really know what to make of this, but maybe someone will…

I'm now seeing the error rate on the English Wikipedia drop to normal levels again. Unless someone else has cleared them out, we have no pages added to https://en.wikipedia.org/wiki/Category:Articles_with_math_errors in the last 24 hours.

Physikerwelt claimed this task.

@SalixAlba strange, we just improved the error logging. Maybe the restbase server is more stable. I will close this; please reopen (or create a new task); if new errors come up.

I'm finding it hard to believe, as the rates of errors I linked in T343648#9241155 have not changed.

I'm finding it hard to believe, as the rates of errors I linked in T343648#9241155 have not changed.

I have no access to the logs, however, it looks like these are network errors. This theory is supported by T348973 which in theory should not be possible. It indicates that the response was not generated by any code path that is supposed to generate the response. By looking at the texvc input which caused the errors it should be possible to get a better feeling if there is a statistical correlation connection between input and error.

@daniel do you think it is possible that the number of retries has changed when we migrated from the virtual restbase client to calling the underling guzzle library directly T334842?

@daniel do you think it is possible that the number of retries has changed when we migrated from the virtual restbase client to calling the underling guzzle library directly T334842?

Yes, that is possible. I'll see if I can find out.

EDIT: I am not seeing any retry logic at all. Not in the old VirtualRESTClient, not in the generic GuzzleHttpRequest, and not in the Math extension. Can you point me to the place where retries should be applied?

The error handling in MathRestbaseInterface::evaluateRestbaseCheckresponse looks like this:

                        if ( isset( $json->detail ) && isset( $json->detail->success ) ) {
				$this->success = $json->detail->success;
				$this->error = $json->detail;
			} else {
				$this->success = false;
				$this->setErrorMessage( 'Math extension cannot connect to Restbase.' );
			}

Since the error message we are seeing is from the else branch, I assume we fail to decode the response body as JSON. So it's probably HTML coming from some itnermediate layer, probably with status 504 or 503. It would be useful to log at least the status, and possible also the first 1000 bytes or so of the response body.

Change 968641 had a related patch set uploaded (by Physikerwelt; author: Physikerwelt):

[mediawiki/extensions/Math@master] Log non json restbase responses

https://gerrit.wikimedia.org/r/968641

Change 968641 merged by jenkins-bot:

[mediawiki/extensions/Math@master] Log non json restbase responses

https://gerrit.wikimedia.org/r/968641

Physikerwelt closed this task as Resolved.EditedNov 7 2023, 2:26 PM

I checked https://en.wikipedia.org/wiki/Category:Articles_with_math_errors and all the found instances render fine now.

This comment was removed by TheDJ.

Its nearly clear, https://en.wikipedia.org/wiki/Barnes_G-function and https://en.wikipedia.org/wiki/Computational_anatomy are still reporting "Math extension cannot connect to Restbase."

So at this point in time 2 actual errors and 14 false positives.

Its down at reasonable background levels, so I think we might just need to accept a small error rate. These do tend to clear themselves after a week or so.

Its nearly clear, https://en.wikipedia.org/wiki/Barnes_G-function and https://en.wikipedia.org/wiki/Computational_anatomy are still reporting "Math extension cannot connect to Restbase."

I just fixed the two by removing whitespace from the respective expressions.