Page MenuHomePhabricator

VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki
Closed, ResolvedPublic

Description

"Error contacting the Parsoid/RESTBase server (HTTP 404)" in Flow and "Error contacting the Parsoid/RESTBase server (HTTP 411)" in the visaul editor.

It appears that anything that uses the visual editor or Flow (which is most of the wiki) can't be edited right now.

Event Timeline

Bugreporter triaged this task as Unbreak Now! priority.Apr 6 2020, 6:54 PM

Is this a new issue? We have a known issue w/ wikitech, but my recollection is that VE "should" work on officewiki, although there were quirks in how restbase was handled.

officewiki is group0, so it's still running 0.12.0-a8, and has been since last Tuesday. <s>I'm guessing officewiki didn't "just" break, but has been broken for a # of weeks. </s> (EDIT: I was wrong!) If it really was working recently, that would be good to know to help narrow down the cause.

Framawiki renamed this task from Office.wiki is mostly broken to VE and Flow fails with "Error contacting the Parsoid/RESTBase server (HTTP 404)" on officewiki.Apr 6 2020, 7:14 PM
Framawiki added a project: SRE.
Framawiki renamed this task from VE and Flow fails with "Error contacting the Parsoid/RESTBase server (HTTP 404)" on officewiki to VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" on officewiki.Apr 6 2020, 7:14 PM
matmarex renamed this task from VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" on officewiki to VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki.Apr 6 2020, 7:23 PM

Change 586423 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] Revert "parsoid: switch to envoy for TLS termination"

https://gerrit.wikimedia.org/r/586423

Change 586423 merged by Giuseppe Lavagetto:
[operations/puppet@production] Revert "parsoid: switch to envoy for TLS termination"

https://gerrit.wikimedia.org/r/586423

Note I tag it as Unbreak Now because of T249533: testwikidata wiki is broken with "Cannot access the database" which may or may not be related.

I don't believe this is related. The envoy change should only have affected private wikis, ie wikis which don't use RESTBase:
https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L17274

The change at fault appears to be 8e9f967d543721b16ee51fc3772976c8963440ae, which I flagged as possibly suspicious based only on timing.

From IRC:

[15:14:58] <cscott> _joe_: any chance you caused VE to break on officewiki?
[15:15:22] <cscott> cf T247389 and T249535
[15:15:23] <stashbot> T247389: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389
[15:15:23] <stashbot> T249535: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" on officewiki - https://phabricator.wikimedia.org/T249535
[15:15:27] <_joe_> cscott: uh it seems not very probable
[15:15:43] <cscott> it's the only conf change that has happened on the parsoid cluster today, as far as i can tell
[15:15:50] <cscott> i'm just looking at the timing
[15:15:58] <_joe_> is that just officewiki or all wikis?
[15:16:46] <cscott> just officewiki afaik
[15:17:00] <_joe_> that's very unprobable then
[15:17:15] <_joe_> unless we somehow closed a loophole we didn't know of
[15:17:29] <cscott> that "unless" is what i'm looking at
[15:17:26] <MatmaRex> cscott: _joe_: VE is getting the "Wikimedia Error" pages as a response from Parsoid. this seems like it might be important
[15:17:39] <_joe_> MatmaRex: on any wiki?
[15:17:53] <MatmaRex> i don't know, but definitely on officewiki
[15:18:15] <_joe_> can someone test another wiki please? if it's just officewiki this might not be related to my change
[15:18:27] <MatmaRex> check the "response" field here: https://logstash-next.wikimedia.org/app/kibana#/doc/2d891220-161a-11ea-a364-c747e6d6cfc2/logstash-mediawiki-2020.04.06?id=rRrgUHEBxWmzajXK0QU1
[15:18:30] <_joe_> anyways, lemme get to my computer
[15:18:31] <cscott> enwiki is fine.
[15:19:46] <cscott> en.beta is fine
[15:19:53] <_joe_> <div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class='text-muted'><code>Request from - via cp1081.eqiad.wmnet, ATS/8.0.6<br>Error: 411, Content Length Required at 2020-04-06 19:04:02 GMT</code></p></div>
[15:19:57] <_joe_> oh
[15:20:10] <MatmaRex> in other cases, we're getting: {"messageTranslations":{"en":"The requested relative path (/office.wikimedia.org/v3/page/html/Homepage/268268) did not match any known handler"},"httpCode":404,"httpReason":"Not Found"}
[15:20:11] <_joe_> sigh, yes, we must have something that doesn't set the content-length header
[15:20:27] <MatmaRex> https://logstash-next.wikimedia.org/app/kibana#/doc/2d891220-161a-11ea-a364-c747e6d6cfc2/logstash-mediawiki-2020.04.06?id=CBngUHEBxWmzajXKcePI
[15:20:29] <_joe_> MatmaRex: that seems restbase though
[15:20:37] <cscott> i think officewiki is special in that it either bypasses restbase or does some other special non-caching thing because it's a private wiki
[15:20:39] <MatmaRex> yeah. there seem to be two problems
[15:20:57] <cscott> i forget exactly how it is set up, but it is definitely a unique and special flower
[15:21:12] <cscott> well, as unique and special as any of the private wikis are
[15:21:41] <_joe_> yes, something like that
[15:21:41] <MatmaRex> _joe_: VE code that sends these requests does not set the 'content-length' header explicitly. should it do that? (or is it supposed to happen automatically somewhere?)
[15:22:07] <_joe_> MatmaRex: I guess it should happen automatically, but let me revert first, ask questions later
[15:22:21] <_joe_> the 404 might not be related to my switch, but the 411 surely is
[15:24:20] <cscott> Pchelolo confirms no restbase for private wikis

Same for VE on otrswiki, another private wiki. https://otrs-wiki.wikimedia.org/w/api.php?action=visualeditor&format=json&paction=parse&page=User%3AFramawiki%2Fmessages&uselang=fr&formatversion=2 fails with

{"error":{"code":"apierror-visualeditor-docserver-http","info":"Error contacting the Parsoid/RESTBase server (HTTP 503)","docref":"See https://otrs-wiki.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes."},"servedby":"mw1347"}

The rollback of today's change is almost completed. I will take the time tomorrow to try to understand what happened. Marking as resolved as I can now consistently edit officewiki. If the problem happens again, then it's not related to my change.

I am going to get to the bottom of this today. My plan of action is:

  • convert one parsoid server back to envoy, depooling it from traffic
  • enable all the possible debug output from the http filter
  • make mwdebug1002 use that server alone. This can be done by hacking the envoy configuration for parsoid-php to point to that server directly.

Ok, I found the culprit:

  • private wikis set the cookie forceHTTPS: true
  • We proxy to parsoid-php via http://localhost:6002
  • envoy thus sees the upstream connection as http, and thus adds the X-Forwarded-Proto: http header to the request
  • Parsoid-php, being MediaWiki, sees the header and thus sends back a redirect to the https version of the same URL
  • This gets sent back to Mediawiki, that will follow the redirect to the public URL (e.g. office.wikimedia.org). The edge caches will route the request to where we serve /w/rest.php, that is the appserver cluster
  • The appserver cluster doesn't have parsoid installed, so returns a 404.

Possible solutions:

  • Use TLS on localhost to talk to envoy
  • Forward the X-Forwarded-Proto header from the client request in the MediaWiki http request
  • Add X-Forwarded-Proto: https at the envoy layer

I think the right solution is the second, and I will create a task to that end, but I will for now implement the latter option.

Change 587227 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::services_proxy: allow adding XFP header, enable on parsoid/restbase

https://gerrit.wikimedia.org/r/587227

Ok, I found the culprit:

  • private wikis set the cookie forceHTTPS: true
  • We proxy to parsoid-php via http://localhost:6002
  • envoy thus sees the upstream connection as http, and thus adds the X-Forwarded-Proto: http header to the request
  • Parsoid-php, being MediaWiki, sees the header and thus sends back a redirect to the https version of the same URL
  • This gets sent back to Mediawiki, that will follow the redirect to the public URL (e.g. office.wikimedia.org). The edge caches will route the request to where we serve /w/rest.php, that is the appserver cluster
  • The appserver cluster doesn't have parsoid installed, so returns a 404.

Possible solutions:

  • Use TLS on localhost to talk to envoy
  • Forward the X-Forwarded-Proto header from the client request in the MediaWiki http request
  • Add X-Forwarded-Proto: https at the envoy layer

I think the right solution is the second, and I will create a task to that end, but I will for now implement the latter option.

There is also the solution of unsetting forceHTTPS: true when making the request from mediawiki to parsoid. Not sure how much I like that though. I too think that carrying forward the X-Forwarded-Proto from the client request is preferable.

@Joe could you take a look at https://gerrit.wikimedia.org/r/579021 and subsequent patches as well. We plan on renaming the parsoid-php lookup key back to plain parsoid, I just want to make sure this won't break anything you're doing here.

Change 587227 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::services_proxy: allow adding XFP header, enable on parsoid/restbase

https://gerrit.wikimedia.org/r/587227

Mentioned in SAL (#wikimedia-operations) [2020-04-08T05:33:48Z] <_joe_> repooling wtp1025, with envoy and logging any error above 404 T249535

Hi I dont know if this is the riight pace to ask. Some people on the francophone wikipedia are receiving error message on flow discussion pages
"Erreur d’accès au serveur Parsoid/RESTBase (HTTP 400"
This issue is also duscussud on discord by confirmed contributors. What can be done?

Hi I dont know if this is the riight pace to ask. Some people on the francophone wikipedia are receiving error message on flow discussion pages
"Erreur d’accès au serveur Parsoid/RESTBase (HTTP 400"
This issue is also duscussud on discord by confirmed contributors. What can be done?

Welcome to Wikimedia Phabricator. You might want to open a new task for that as this one specifically dealt with an issue on office wiki.

Hi I dont know if this is the riight pace to ask. Some people on the francophone wikipedia are receiving error message on flow discussion pages
"Erreur d’accès au serveur Parsoid/RESTBase (HTTP 400"
This issue is also duscussud on discord by confirmed contributors. What can be done?

This is referenced as T249997: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400).