Re-consider using OO.getHash in convertModelFromDom
Open, NormalPublic

Description

convertModelFromDom stage spends a lot of time in OO#getHash. If instead of hashing I assign each object an autoincrement integer key, the median CPU time for Barack Obama drops by 50-100ms, and the min time drops by 100-200ms.

Is there a hard requirement that objects with identical key/value pairs hash to the same key, or is that just a strategy to save memory? If the latter, then it is probably worth abandoning it in favor of using array indices.

ori created this task.Feb 23 2015, 7:21 AM
ori updated the task description. (Show Details)
ori raised the priority of this task from to Needs Triage.
ori added a project: VisualEditor-Performance.
ori added subscribers: ori, Catrope.
Restricted Application added a project: VisualEditor. · View Herald TranscriptFeb 23 2015, 7:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdforrester-WMF renamed this task from Reconsider OO.getHash to Re-consider using OO.getHash in convertModelFromDom.Feb 23 2015, 6:15 PM
Jdforrester-WMF triaged this task as Normal priority.
Jdforrester-WMF set Security to None.
Jdforrester-WMF moved this task from To Triage to TR7: ???? on the VisualEditor board.
Krinkle added a subscriber: Krinkle.Mar 1 2015, 1:50 AM

@Catrope and I talked about this back in March 2013.

Two weeks ago I revived the subject while going through old Gist files.

https://gist.github.com/Krinkle/04600fa0532701440446/revisions

https://gist.github.com/Krinkle/04600fa0532701440446.

Many data characters in VisualEditor have the same annotations (e.g. bold, italic etc.) to optimise we re-use object references.

Native references (old)

Properties:

  • Cheap in memory
  • Native garbage collection
  • Easy and direct comparison (by object reference)
  • Serialised form is very big
  • When serialising, it is automatically usable across instances as it will expand the object references and stringify each in-place
    • when copy-pasting across windows
    • when sending/receiving transactions from other users over the wire (for the future)
  • Unserialising will result in separate objects for each annotation, we lose the optimisation.

Local index (new)

New proposal:

  • Abstract the references concept into a manual index of annotations and refer to them by numerical id.
  • Cheap in memory (though slightly more expensive than the old way. more: index array and maintenance functions)
  • Manual garbage collection
  • Easy comparison (by numerical id)
  • Serialisation is much smaller
  • When serialising we have to include the index as the ids may be different on different VE instances
  • When deserialising we have to merge the index, figure out duplicates and update references to ids they'll have in this document.

The main disadvantage of the old method (and the main motivation to adopt to this proposal) is this last item.

Though we this optimisation could actually be done to the old method as well. We could keep an index of annotations keyed by hash and use that object instead of the one given. To avoid deserialising an object only to be garbage collected, we could have the serialising output hashes instead of the objects themselves.

The advantage of this alternative proposal is that we don't have to include the index. The original "Local index" proposal above however has the extra advantage of being much faster (numerial indexes instead of string hashes, much cheaper to compute / lookup, no need to call ve.getHash).

So, it was interesting to check out, but the idea as-is looks good to me. Nothing I'd different.