Page MenuHomePhabricator

Use a bag-on-the-side implementation, rather than an internal .dataobject for node data
Closed, ResolvedPublic

Description

DU.(get|set)NodeData touch node object internals, which won't be available after a php port.

Maybe combine this with the tagId set during treebuilding (https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/HTML5TreeBuilder.js#L61)

Event Timeline

Arlolra triaged this task as High priority.
Arlolra lowered the priority of this task from High to Medium.Sep 17 2018, 9:25 PM

17:22 <+subbu> arlolra, "DU.(get|set)NodeData touch node object internals, which won't be available after a php port." ... I annotated the notes later indicating that I may have been mistaken because Tim seems have to written code assuming it works (in 2014).
17:22 <+subbu> so, either the dom impl has changed, he didn't test it, or i didn't use the right api method.
17:22 <+subbu> that apart, it is still useful to do the bag on the side impl.

17:24 <+cscott> i think PHP supports "property added after the fact", sort of like JS does
17:24 <+cscott> but (a) it's known to be quite slow in PHP, probably not appropriate for something used as often as data-parsoid/etc annotations
17:24 <+cscott> and (b) it *may* not be supported on native objects, which the DOM node classes might be? worth a test, maybe...
17:25 <+cscott> ...but better yet if we don't need to
17:25 <+cscott> http://php.net/manual/en/book.dom.php
17:25 <+cscott> i don't immediately see anything specific to mapping DOM nodes to native PHP objects

We were mistaken about php and dynamic properties (.dataobject) on instances of classes ... so, some of the rationale for this is not valid anymore. But, from a code cleanliness POV (and performance as @cscott refers to in his comment), this may still be a good thing to do.

So it turns out that PHP's DOM classes are a bit weird. As long as a reference to the DOMNode exists in PHP, things work fine. But it seems the DOMNode object gets destroyed when the last PHP reference goes away, and a new one is then created when the same node is later accessed.

$doc = new DOMDocument;
$node = $doc->createElement( 'b' );
$node->appendChild( $doc->createElement( 'a' ) );

$a = $node->firstChild;
$a->dataobject = [];
var_dump( isset( $a->dataobject ) );                // => true
var_dump( isset( $node->firstChild->dataobject ) ); // => true

$a = null;
var_dump( isset( $node->firstChild->dataobject ) ); // => false !!
$node->firstChild->dataobject = [];
var_dump( isset( $node->firstChild->dataobject ) ); // => false !!

$a = $node->firstChild;
var_dump( isset( $node->firstChild->dataobject ) ); // => false
$node->firstChild->dataobject = [];
var_dump( isset( $node->firstChild->dataobject ) ); // => true
var_dump( isset( $a->dataobject ) );                // => true

So, it looks like we need to implement this on the JS side first and not rely on persistence of the PHP wrapper objects around libxml2.

ssastry raised the priority of this task from Medium to High.Feb 4 2019, 8:23 PM

https://stackoverflow.com/questions/8707235/how-to-create-new-property-dynamically has some benchmarks showing how slow "expando" properties are.

If you're referring to https://stackoverflow.com/a/8707348/634419, that's something rather different. We're not looking at implementing __get() or __set() here. Try this:

<?php

define( 'REPS', 100000000 );

class A {
}

$t0 = microtime( true );
$a = new A();
for ( $i = 0; $i < REPS; $i++ ) {
    $a->data = 'test' . $i;
}
$t1 = microtime( true );
echo "Added property: " . ( $t1 - $t0 ) . "\n";

class B {
    public $data;
}

$t0 = microtime( true );
$b = new B();
for ( $i = 0; $i < REPS; $i++ ) {
    $b->data = 'test' . $i;
}
$t1 = microtime( true );
echo "Declared property: " . ( $t1 - $t0 ) . "\n";

On my laptop, it looks like the added property is a whole 3 or 4 nanoseconds slower.

Change 489332 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Move .dataobject off nodes

https://gerrit.wikimedia.org/r/489332

Change 489332 abandoned by Arlolra:
[WIP] Move .dataobject off nodes

https://gerrit.wikimedia.org/r/489332

So it turns out that PHP's DOM classes are a bit weird. As long as a reference to the DOMNode exists in PHP, things work fine. But it seems the DOMNode object gets destroyed when the last PHP reference goes away, and a new one is then created when the same node is later accessed.

[15] boris> $doc = DOMDocument::loadHTML('<b>hi');
 → object(DOMDocument)(

)
[16] boris> $doc->env = [ 'doc' => $doc ];
 → array(
  'doc' => object(DOMDocument)(
    'env' => *** RECURSION ***
  )
)
[17] boris> $doc->lastChild->ownerDocument->env;
 → array(
  'doc' => object(DOMDocument)(
    'env' => *** RECURSION ***
  )
)
[18] boris> $n = $doc->lastChild;
 → object(DOMElement)(

)
[19] boris> $doc = null;
 → NULL
[20] boris> $n->ownerDocument->env;
 → array(
  'doc' => object(DOMDocument)(
    'env' => *** RECURSION ***
  )

Huh, the circular reference really does keep the PHP DOMDocument alive as long as a node is alive?

Change 490654 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Use a bag-on-the-side implementation for node data

https://gerrit.wikimedia.org/r/490654

Change 490654 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Use a bag-on-the-side implementation for node data

https://gerrit.wikimedia.org/r/490654