Page MenuHomePhabricator

[Task] Profile entity deserialization
Closed, ResolvedPublic

Description

It seems like the full deserialization of Entities uses a lot of CPU and RAM. We should be able to point out where exactly vast amount of resource are being used to be able to implement the following improvements:

  • Have specialized deserializers generating specialized model objects.
  • Deferred deserialization. At some level stop deserialization, remember the fragment of the array , continue when asked for the deserialized value. (lazy load)
  • Just do not deserialize. Pass the JSON blob (or intermediate array structure) around.

Event Timeline

daniel created this task.Feb 25 2015, 12:39 PM
daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a subscriber: daniel.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2015, 12:39 PM

@daniel: Associating a project to this task highly welcome so someone might find it when searching by open tasks per project. :)

Lydia_Pintscher triaged this task as High priority.Mar 30 2015, 9:49 AM
Lydia_Pintscher set Security to None.
Lydia_Pintscher renamed this task from Benchmark deferred entity deserialization to [Task] Benchmark entity deserialization.Sep 8 2015, 2:22 PM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher added a subscriber: Lydia_Pintscher.

Notes from story time:

Collecting options and wild ideas:

  • Profile, find hotspots, try to improve the existing architecture in small but relevant steps.
  • Have specialized deserializers generating specialized model objects.
  • Deferred deserialization. At some level stop deserialization, remember the fragment of the array (typically: all statements, e.g. all aliases) and the relevant deserializers, continue when asked for the deserialized value.

Lua needs to be adapted to make use of this. Currently a lot of code assumes it has a fully deserialized entity *OR* code just works on a JSON blob.
Special case: Terms (labels, descriptions, aliases) are objects. Hundreds of thousands of them. Having a simpler interface that just returns arrays of strings will save a lot.

  • Just do not deserialize. Pass the JSON blob (or intermediate array structure) around.

Most relevant point is where an entity is passed to Lua. Note: Lua is not CPU critical, but memory critical!
E.g. when editing a label, statements *NEVER* need to be deserialized!
The case where “Germany” needed super-protection (no Lua involved) will probably also become relevant again.

Team decision: Profiling. Then avoid unserialization where not needed.
JanZerebecki moved this task from incoming to ready to go on the Wikidata board.Sep 10 2015, 9:28 PM

A few months ago I looked at this briefly and one finding was that even if we get the cost of PHP arrays to PHP objects to 0, the cost of json_decode on large entities can be significant. This means that it's crucial to avoid using any such deserialization process on a collection of entities to for instance get a list of labels. Using a regex or some such could help. And of course having an index for such operations, so the blobs don't need to be retrieved is good.

Tobi_WMDE_SW renamed this task from [Task] Benchmark entity deserialization to [Task] Profile entity deserialization.Sep 15 2015, 1:44 PM
Addshore closed this task as Resolved.Jan 23 2019, 12:34 PM
Addshore claimed this task.
Addshore added a subscriber: Addshore.

Resolved as we do profile this every now and again over the years.
We know it is slow.
We also now track save timing for wikibase specifically (which this is a part of)

Restricted Application added a project: User-Addshore. · View Herald TranscriptJan 23 2019, 12:34 PM