Page MenuHomePhabricator

[[officewiki:Contact list]] fails to load in VE since it receives a HTTP 504 from the server
Closed, DuplicatePublicBUG REPORT

Description

Steps to reproduce

  1. Login and visit https://office.wikimedia.org/wiki/Contact_list on the desktop site
  2. Click Edit Source
  3. Click Visual Editing

Expected results

  • Page loads to edit Contact List and allows user to edit in visual editor

Actual results

"The server did not respond within the expected time." error message, with a progress bar around 70% complete in the background.

Environments observed

  • Browser version: 10.13.6
  • OS version: High Siera
  • Device model:
  • Device language: English

Check any additional observations

  • Observed while logged in

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ppelberg I assigned it to you for you to prioritize.

JTannerWMF lowered the priority of this task from Unbreak Now! to High.Feb 7 2020, 9:54 PM
JTannerWMF raised the priority of this task from High to Unbreak Now!.
Jdforrester-WMF renamed this task from [Bug] Server did not respond within the expected time message popups when using Visual Editor on Office Wiki to Parsoid times out when using VisualEditor to load the Contact list on OfficeWiki .Feb 7 2020, 9:59 PM
Jdforrester-WMF added a project: Parsoid.
ssastry lowered the priority of this task from Unbreak Now! to Medium.Feb 7 2020, 10:07 PM

Making a source edit and clicking 'show preview' shows up the preview in under 30s. This preview also uses Parsoid. Maybe the preview doesn't use the pagebundle endpoint, so, it is possible this is an issue with the pagebundle endpoint. OR some other code that runs in the VE extension after Parsoid HTML is fetched. To be investigated.

I tried following https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Request_profiling to debug the problem (or at least identify if VE's API code or Parsoid is to blame), but every single page on https://performance.wikimedia.org/xhgui/ returns a "Slim Application Error" for me, and "inline profile" doesn't appear for API requests.

@matmarex Be sure to have "Inline profile" off if "XHGui" is on (known issue), and to click on the timetamp in the list viewer to open the profile details (clicking on the url is more intuitive but actually submits a search query which we have disabled currently in an ugly way, this will be fixed with T180761).

Huh, okay, thanks @Krinkle. That works. So I got this: https://performance.wikimedia.org/xhgui/run/view?id=5e3df76dbb8544953953b502

What I see here:

  • Request took ~19 seconds
  • Out of this, ~18.8 seconds is spent in ApiVisualEditor::requestRestbase/MultiHttpClient::runMultiCurl, which is where we fetch stuff from Parsoid
  • ApiVisualEditor::requestRestbase is only called 1 time, as expected
  • (MultiHttpClient::runMultiCurl is called 2 times? This seems to happen inside VirtualRESTServiceClient. Seems weird to me but maybe this is expected.)
  • Clearly, Parsoid is slow

@ppelberg I assigned it to you for you to prioritize.

Great spot, Jazmin. Thank you for filing and pinging. I've tagged this ticket with our tracking tag considering this seems to be a Parsing issue.

As for the priority, I agree with the point @matmarex raised in chat: as a first step, it seems important to spend 1-2 hours within the next week investigating what might be causing Parsoid's performance on the page to worsen.

I do not think this qualifies as UBN considering how limited the breakage seems to be [2].

@ssastry, does "...spend[ing] an hour or two within the next week..." sound appropriate to you?


  1. I assume the performance started to suffer significantly around 17-January, after which point the source mode was used for all edits to the page, save for one.
  2. Limited to the Office Wiki Staff page. Larger pages seems to load without issue. See: Community_Relations_Specialists/Audiences/Meetings_with_Product (162,835 bytes)

The issue was interesting so I was still looking into it :D

Apparently the slowness is a result of the huge number of images on that page.

I made this edit: https://office.wikimedia.org/w/index.php?title=Template:Contact_info&diff=prev&oldid=264698&diffmode=source to remove all the icon images for the contact methods and add them via TemplateStyles instead, and as a result, the page now loads for editing for me in ~11s instead of ~20s. (This also improved the PHP Parser speed from 4s to 2.8s.) It's still slow because all the photos remain, and of course I can't remove those, but hopefully that's enough to stop it from timing out.

The images were re-used hundreds of times, and ideally Parsoid would do… whatever-it-does with them that is so slow once once per image instead of once per use, but that might not be possible here because the images were inside templates.

Since we now know the cause, I think we can deprioritize this task, or outright decline it.

CDanis raised the priority of this task from Medium to Needs Triage.Feb 8 2020, 1:26 AM
CDanis set Security to Software security bug.
CDanis changed the visibility from "Public (No Login Required)" to "Custom Policy".
CDanis changed the subtype of this task from "Bug Report" to "Security Issue".
CDanis added a subscriber: CDanis.

Protecting as security issue due to private phone numbers being included in the screenshot

[OT] May I delete the screenshot, replace "Actual results" by text, and make this task public again? (Plus might be rather WMF-NDA than Security).

Yes, please. I tagged as security just for expediency's sake. (It would
be nice to have other such buttons easily accessible in Phab's sidebar.)

Aklapper changed the subtype of this task from "Security Issue" to "Bug Report".Feb 8 2020, 10:54 AM
Aklapper removed projects: Security-Team, acl*security.
Aklapper updated the task description. (Show Details)
Aklapper changed the subtype of this task from "Bug Report" to "Task".
Aklapper changed the visibility from "Custom Policy" to "Public (No Login Required)".
Aklapper changed the subtype of this task from "Task" to "Bug Report".

(Garr, sorry for the noise. The "Bug Report" subtype does not display a "Visible to" field in its Edit mode so had to go to "Task" first...)

Thanks everyone involved in filing, triaging, discussing, and cleaning up.

Thanks @matmarex for the investigation and the wikitext fix (which seems to improve perf for both parsers). But, we are still interested in looking at why Parsoid is losing perf there. AFAIK, we have a (single) batch API call for all images encountered on the page. However, this batch API call probably does not de-duplicate images and is probably causing duplicate lookups on the database end (even if it is batched somehow in core). So, that is a potential perf opt we should definitely do on our end. So, we'll talk about it and retarget this phab task for that work.

ssastry triaged this task as Medium priority.Feb 9 2020, 7:58 PM

I've responded to the XHGui issue, but other than that looks like VE and Parsoid are handling this. Let us know if there's anything you'd like our help with.

ssastry raised the priority of this task from Medium to High.Mar 24 2020, 4:26 PM

It is odd that when I click the Edit link, I get the 504 pretty quickly (< 15 s). And, I turned on XHGui in my Wikimedia-Debug extension and issued a request to the MediaWiki API and as https://performance.wikimedia.org/xhgui/run/view?id=5e7a3f3b3f3dfa0b0a8a3788 shows that request completed in ~8s.

So, we need to figure out what exactly is going on in the VE (JS client) -> VE (PHP extension) -> Parsoid chain and response processing back up that is causing the 504 back to VE (client). I didn't see anything in logstash that illuminates this.

So, I can consistently parse this page on scandium in ~10s (see below for 2 instances) and I can confirm that the output is legit.

ssastry@scandium:~$ time sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php /srv/parsoid-testing/bin/parse.php --wiki=officewiki --integrated --pageName "Contact_list" < /dev/null > /tmp/contact_list.html

real	0m10.055s
user	0m8.940s
sys	0m0.244s
ssastry@scandium:~$ time sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php /srv/parsoid-testing/bin/parse.php --wiki=officewiki --integrated --pageName "Contact_list" < /dev/null > /tmp/contact_list.html

real	0m10.144s
user	0m8.932s
sys	0m0.264s

So, I don't think this is a Parsoid issue. @matmarex can you check if VE (PHP extension code) that processes Parsoid HTML before passing it along to VE (JS front-end) is failing on this output?

ssastry renamed this task from Parsoid times out when using VisualEditor to load the Contact list on OfficeWiki to [[officewiki:Contact list]] fails to load in VE since it receives a HTTP 504 from the server.Apr 6 2020, 10:19 PM

Loading this page in the source editor works. But, when I make an edit and preview the page, I get this:

Preview of edit summary:
No edit summary
[object Object]

So, that same error that causes the load in VE to fail also causes the preview to fail.

ssastry added a subscriber: MBinder_WMF.

@JTannerWMF This needs VE investigation.

On second thoughts, I suspect this may actually be a performance issue with the PEG tokenizer in Parsoid.

Change 699425 had a related patch set uploaded (by Arlolra; author: Arlolra):

[operations/puppet@production] Bump envoy timeout for parsoid-php

https://gerrit.wikimedia.org/r/699425

RE T244609#7152826, ordinarily we'd close T279825 as a duplicate of T244609 (this ticket), considering T244609 was filed first.

Although, because of the valuable context in T279825#7151307, I'm making an exception to this convention.

Change 700077 had a related patch set uploaded (by Arlolra; author: Arlolra):

[operations/mediawiki-config@master] Switch to using parsoid-async for direct VirtualRestClient connects

https://gerrit.wikimedia.org/r/700077

Change 699425 merged by Legoktm:

[operations/puppet@production] Bump envoy timeout for parsoid-php

https://gerrit.wikimedia.org/r/699425

Change 700077 abandoned by Arlolra:

[operations/mediawiki-config@master] Switch to using parsoid-async for direct VirtualRestClient connects

Reason:

https://gerrit.wikimedia.org/r/700077

We saw some complaints of this again today on Slack, or at least a presumably very similar one where it's erroring with "Error contacting the Parsoid/RESTBase server: (curl error: 28) Timeout was reached". Couldn't reproduce it myself, though I was trying a few hours later, so it might have been some intermittent issue.

Peter closed this as duplicate of T279825, where the following may be related: T279825#7175189

The backend takes about 12 seconds. I used

time curl -w\\n -H'X-Forwarded-Proto: https' -H'Cookie: ...'  -x wtp1038.eqiad.wmnet:80 http://office.wikimedia.org/w/rest.php/office.wikimedia.org/v3/transform/wikitext/to/html/Contact%20list/299182 -F wikitext=@contact.txt -o /dev/null

which I think is pretty close to what VE does. Not sure if there's a log of these requests. Apparently the http channel goes to /dev/null.