Page MenuHomePhabricator

[Task] Profile entity deserialization
Closed, ResolvedPublic

Description

It seems like the full deserialization of Entities uses a lot of CPU and RAM. We should be able to point out where exactly vast amount of resource are being used to be able to implement the following improvements:

  • Have specialized deserializers generating specialized model objects.
  • Deferred deserialization. At some level stop deserialization, remember the fragment of the array , continue when asked for the deserialized value. (lazy load)
  • Just do not deserialize. Pass the JSON blob (or intermediate array structure) around.

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel subscribed.

@daniel: Associating a project to this task highly welcome so someone might find it when searching by open tasks per project. :)

Lydia_Pintscher set Security to None.
Lydia_Pintscher renamed this task from Benchmark deferred entity deserialization to [Task] Benchmark entity deserialization.Sep 8 2015, 2:22 PM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher subscribed.

Notes from story time:

Collecting options and wild ideas:

  • Profile, find hotspots, try to improve the existing architecture in small but relevant steps.
  • Have specialized deserializers generating specialized model objects.
  • Deferred deserialization. At some level stop deserialization, remember the fragment of the array (typically: all statements, e.g. all aliases) and the relevant deserializers, continue when asked for the deserialized value.

Lua needs to be adapted to make use of this. Currently a lot of code assumes it has a fully deserialized entity *OR* code just works on a JSON blob.
Special case: Terms (labels, descriptions, aliases) are objects. Hundreds of thousands of them. Having a simpler interface that just returns arrays of strings will save a lot.

  • Just do not deserialize. Pass the JSON blob (or intermediate array structure) around.

Most relevant point is where an entity is passed to Lua. Note: Lua is not CPU critical, but memory critical!
E.g. when editing a label, statements *NEVER* need to be deserialized!
The case where “Germany” needed super-protection (no Lua involved) will probably also become relevant again.

Team decision: Profiling. Then avoid unserialization where not needed.

A few months ago I looked at this briefly and one finding was that even if we get the cost of PHP arrays to PHP objects to 0, the cost of json_decode on large entities can be significant. This means that it's crucial to avoid using any such deserialization process on a collection of entities to for instance get a list of labels. Using a regex or some such could help. And of course having an index for such operations, so the blobs don't need to be retrieved is good.

Tobi_WMDE_SW renamed this task from [Task] Benchmark entity deserialization to [Task] Profile entity deserialization.Sep 15 2015, 1:44 PM
Addshore claimed this task.
Addshore subscribed.

Resolved as we do profile this every now and again over the years.
We know it is slow.
We also now track save timing for wikibase specifically (which this is a part of)