Page MenuHomePhabricator

Revision updates with Jobrunner for Parsoid and RESTBase
Closed, InvalidPublic


In order to be up-to-date with content, Parsoid uses MW hooks to notify it of any changes. This approach has known to cause problems, as it can overflow the MW API with requests in cases where a template transcluded in multiple pages is updated, since all of them need to be regenerated as well.

Until T84923 is resolved, RESTBase has opted for using the same update mechanism for updating the content in its storage. The update mechanism prompts RESTBase to place one call to the MW API (requesting revision info) and another to Parsoid for obtaining the refreshed content.

The problem lies in the fact that all of the aforementioned requests are made with Cache-Control: no-cache headers, causing the following chain of events for each page that needs updating:

  1. Jobrunner requests Parsoid to generate the new revision's HTML
  2. Parsoid fetches the content from MW API and generates it
  3. Jobrunner requests RESTBase to get a fresh copy of the content as well as revision info
  4. RESTBase calls the MW API's revprop
  5. RESTBase calls Parsoid's pagebundle endpoint
  6. Parsoid fetches the content from MW API

Concretely, the problem is fetching the content twice from the MW API (steps 2 and 6).

Since both update extensions monitor the same hooks and ultimately both update Parsoid's cache, the question is: can we deprecate Paroid's extension and rely on the RESTBase one to update it in order to minimise the impact on MW's API? That would probably need to involve some Varnish trickery given that Parsoid's update extension uses v1 API to refresh the content, while RESTBase relies on Parsoid v2 endpoints.

Event Timeline

mobrovac raised the priority of this task from to Needs Triage.
mobrovac updated the task description. (Show Details)
mobrovac added subscribers: mobrovac, GWicke, Eevans, Joe.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 12 2015, 12:40 PM
mobrovac triaged this task as High priority.Mar 12 2015, 12:43 PM
mobrovac set Security to None.
mobrovac edited subscribers, added: ssastry; removed: Aklapper.
mobrovac updated the task description. (Show Details)Mar 12 2015, 2:02 PM
GWicke added a comment.EditedMar 13 2015, 5:59 AM

This doesn't look very accurate.

  1. The duplicate generation is temporary until the Parsoid v1 API can be retired, which is approximately right after VE is switched over (possibly next week). Moreover, the v2 API (JSON) responses are intentionally uncached to avoid diluting the v1 (HTML) response cache in the meantime.
  1. Parsoid already reuses template expansion and/or images from the previous version's HTML. While there is still a lot of potential for more reuse, this means that the number of API requests is actually not that large.

I also don't see any data that shows that there is actually a performance problem. Here is a graph showing the Parsoid cluster load on the day we enabled full change tracking for all wikipedias in RESTBase:

Can you spot the time of the switch?

GWicke closed this task as Invalid.Mar 13 2015, 6:01 AM
GWicke claimed this task.

I'm going ahead and closing this as invalid, as I think it's mostly based on a misunderstanding. Please reopen if you feel that there's something we need to address here.