Page MenuHomePhabricator

ApiStashEdit silently loses custom DataUpdates.
Closed, ResolvedPublic

Description

Extensions can use ParserOutput::addSecondaryDataUpdate() to register a callback to be executed when saving the respective content. Typically this is used to track some aspects of the page content in extra database tables.

Since DataUpdate objects are generally not serializable, ParserOutput skips them during serialization (see ParserOutput::__sleep). This was no problem before, since the parser output would only go to the cache after it had been saved, and WikiPage would apply DataUpdates based on a "fresh" instance if ParserOutput.

ApiStashEdit changes this: now, the ParserOutput that WikiPage uses while saving the content may be coming from the cache. In that case, it would have silently lost any DataUpdates that an extension may have registered, effectively disabling the custom tracking of page content.

Details

Related Gerrit Patches:

Event Timeline

daniel created this task.Jan 9 2015, 3:29 PM
daniel raised the priority of this task from to Needs Triage.
daniel triaged this task as High priority.
daniel updated the task description. (Show Details)
daniel added subscribers: aude, hoo, Lydia_Pintscher and 2 others.
daniel added a subscriber: daniel.

Bumping to high, since this breaks at least the WikibaseClient extension.

daniel added a comment.Jan 9 2015, 3:32 PM

The only fix I can think of offhand would be to check if there are custom updates, and if there are, don't stash the edit. That would only happen after burning CPU cycles for parsing though.

Note that this would effectively disable the use of ApiStashEdit for all wikis/namespaces that act as Wikibase clients - in particular, all Wikipedia articles.

Change 183861 had a related patch set uploaded (by Daniel Kinzler):
Skip ApiStashEdit if custom DataUpdates are present.

https://gerrit.wikimedia.org/r/183861

Patch-For-Review

Change 183861 merged by jenkins-bot:
Skip ApiStashEdit if custom DataUpdates are present.

https://gerrit.wikimedia.org/r/183861

Anomie added a comment.Jan 9 2015, 4:37 PM

Since DataUpdate objects are generally not serializable, ParserOutput skips them during serialization (see ParserOutput::__sleep).

I don't know about "are generally not serializable", but the reasoning given both in the comment and in Gerrit change 119342 is to avoid saving unneeded data.

The only fix I can think of offhand would be to check if there are custom updates, and if there are, don't stash the edit.

Or you could rework things so DataUpdate must be serializable, and have ApiStashEdit set a flag on the ParserOutput to tell it to preserve the update objects. The only user of ParserOutput::addSecondaryDataUpdate() in extensions in git seems to be Wikibase itself.

As for implementations of DataUpdate in core, at first glance it doesn't seem like they're necessarily non-serializeable. The DB connection in SqlDataUpdate could be excluded in sleep and reopened in wakeup, and similar things could be done for the WikiPage needed by LinksDeletionUpdate and the Title in LinksUpdate.

Change 183890 had a related patch set uploaded (by Daniel Kinzler):
Fix ApiStashEdit wrt custom DataUpdates.

https://gerrit.wikimedia.org/r/183890

Patch-For-Review

daniel added a comment.Jan 9 2015, 7:56 PM

@Anomie: adding the guarantee that DataUpdates must be serializable would make the mechanism a lot more complex (right now, it's an easy way to provide callbacks). I submitted a patch that uses a flag in ParserOutput - that could be extended to only trigger on non-serializable DataUpdates.

Change 183890 merged by jenkins-bot:
Fix ApiStashEdit wrt custom DataUpdates.

https://gerrit.wikimedia.org/r/183890

Change 183887 had a related patch set uploaded (by Aaron Schulz):
Made some DataUpdate classes serializable

https://gerrit.wikimedia.org/r/183887

Patch-For-Review

Change 183887 abandoned by Aaron Schulz:
Made some DataUpdate classes serializable

Reason:
Doing this in a cleaner way

https://gerrit.wikimedia.org/r/183887

Change 184463 had a related patch set uploaded (by Aaron Schulz):
Introduced LazyDataUpdate class

https://gerrit.wikimedia.org/r/184463

Patch-For-Review

aaron added a comment.EditedJan 13 2015, 1:18 AM

@daniel: what do you think of https://gerrit.wikimedia.org/r/184463. Would it be hard to build off? I'd really like to get stashing on again, since it saves 300+ms on average saves.

@aaron I'm afraid that patch doesn't address the actual problem: yes, you can get around the serialization update if you rely on global state. But if you do that, you don't need an extra wrapper thingy, just do it in the DataUpdate.

Note that Wikibase will no longer put DataUpdates into the ParserOutput once I93f09dc5b is merged, see T86308.

Maybe it's best to just remove the DataUpdates functionality from ParserOutput alltogether. The LinksUpdate thing can go into AbstractContent::getSecondaryDataUpdates(). ParserOutput::getSecondaryDataUpdates() isn't used anywhere directly, as far as I can see.

If we do that, we may want to add a hook to AbstractContent::getSecondaryDataUpdates().

ori added a comment.Jan 13 2015, 5:38 PM

@daniel Can you explain why this work needs to be done synchronously, blocking the user?

daniel added a comment.EditedJan 13 2015, 6:24 PM

@ori: simply be cause there is currently no mechanism (as far as I know) for extensions to provide deferred updates. Only some specific hard-coded updates are deferred, and everything else is executed right away (see the calls to DeferredUpdates::addUpdate in WikiPage).

DataUpdate itself could somehow indicate whether it wants to be run

  1. in the transaction that updates the page table
  2. immediately after updating the page table
  3. at the end of the web request
  4. some time later (job queue)

I'd support that feature request, but I'm not volunteering to implement it :)

ori added a comment.Jan 13 2015, 6:24 PM

@daniel, thanks for clarifying. Could you also explain (in practical terms) what are the updates being performed?

On the client wiki, we are tracking which page uses which data item, by inserting rows into the wbc_entity_usage table. On the repo, we do similar tracking of which data item links to which client page (wb_items_per_site), which labels, aliases and descriptions are defined for an item (wb_terms), which property has which type (wb_peroperty_info), etc. Basically, quite similar to what the link tables do in core.

Deferring the updates until the end of the request would probably be ok, but might prove problematic for tables used for uniqueness checks (currently, wb_terms and wb_items_per_site, though I hope to reduce this in the future, see T74430).

Change 184463 abandoned by Aaron Schulz:
Introduced LazyDataUpdate class

Reason:
Wikibase no longer uses addSecondaryDataUpdates()

https://gerrit.wikimedia.org/r/184463

JohnLewis moved this task from incoming to ready to go on the Wikidata board.Jan 24 2015, 1:17 PM
daniel closed this task as Resolved.Feb 19 2015, 4:40 PM
daniel claimed this task.

Wikibase no longer uses addSecondaryDataUpdates()