Page MenuHomePhabricator

Update ParsoidExtensionAPI to be a coherent and functional extension API to aid extension implementations
Closed, ResolvedPublic

Description

We should use Parsoid's native implementation of several extensions (Cite, Pre, Nowiki, Gallery, Poem) to identify a narrow but coherent and consistent extension API that is sufficient for the extensions to do their job painlessly. This will require thinking about which of Parsoid's internal concepts are first-class language / MediaWiki concepts and which are Parsoid's implementation-specific concepts and use that to design the appropriate interface that can be published a first draft extension API.

But, perhaps the first step is to audit what of Parsoid's implementation details are currently exposed to the natively implemented extensions.

Details

ProjectBranchLines +/-Subject
mediawiki/vendormaster+2 K -4 K
mediawiki/services/parsoidmaster+116 -92
mediawiki/services/parsoidmaster+90 -76
mediawiki/vendormaster+648 -597
mediawiki/services/parsoidmaster+93 -48
mediawiki/vendormaster+240 -187
mediawiki/services/parsoidmaster+20 -33
mediawiki/services/parsoidmaster+11 -10
mediawiki/services/parsoidmaster+8 -8
mediawiki/services/parsoidmaster+25 -28
mediawiki/vendormaster+775 -542
mediawiki/services/parsoidmaster+304 -28
mediawiki/services/parsoidmaster+15 -31
mediawiki/services/parsoidmaster+41 -58
mediawiki/services/parsoidmaster+65 -37
mediawiki/services/parsoidmaster+24 -42
mediawiki/services/parsoidmaster+79 -35
mediawiki/services/parsoidmaster+37 -20
mediawiki/services/parsoidmaster+20 -13
mediawiki/services/parsoidmaster+43 -26
mediawiki/services/parsoidmaster+75 -62
mediawiki/services/parsoidmaster+13 -70
mediawiki/services/parsoidmaster+362 -164
mediawiki/services/parsoidmaster+17 -20
Show related patches Customize query in gerrit

Related Objects

Event Timeline

ssastry created this task.Jan 14 2020, 12:56 PM
Restricted Application added subscribers: jeblad, Aklapper. · View Herald TranscriptJan 14 2020, 12:56 PM
[subbu@earth:~/work/wmf/parsoid/src/Ext] git grep -h '^use' | grep Parsoid | sort | uniq -c
      9 use Parsoid\Config\Env;
      7 use Parsoid\Config\ParsoidExtensionAPI;
      7 use Parsoid\Ext\Extension;
      8 use Parsoid\Ext\ExtensionTag;
      6 use Parsoid\Html2Wt\SerializerState;
      3 use Parsoid\Tokens\DomSourceRange;
      2 use Parsoid\Tokens\KV;
      2 use Parsoid\Tokens\SourceRange;
      5 use Parsoid\Utils\ContentUtils;
      8 use Parsoid\Utils\DOMCompat;
      8 use Parsoid\Utils\DOMDataUtils;
      8 use Parsoid\Utils\DOMUtils;
      4 use Parsoid\Utils\PHPUtils;
      2 use Parsoid\Utils\Title;
      2 use Parsoid\Utils\TokenUtils;
      4 use Parsoid\Utils\Util;
      3 use Parsoid\Utils\WTUtils;
      2 use Parsoid\Wt2Html\TT\Sanitizer;

Change 565563 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: Start untangling Parsoid internals from extensions

https://gerrit.wikimedia.org/r/565563

Change 566417 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] No need to explicitly pass 'inTemplate' flag from extension code

https://gerrit.wikimedia.org/r/566417

Change 566417 merged by jenkins-bot:
[mediawiki/services/parsoid@master] No need to explicitly pass 'inTemplate' flag from extension code

https://gerrit.wikimedia.org/r/566417

The above gerrit patches have moved the needle on this by doing a cleanup of extension code in the Parsoid repo by reducing exposure of Parsoid internals. But more work is needed. See commit message of gerrit 565563

LGoto assigned this task to ssastry.Jan 28 2020, 10:00 PM
LGoto removed a subscriber: ssastry.

Change 569702 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Use extension config option for html2wt formatting of extension tags

https://gerrit.wikimedia.org/r/569702

Change 565563 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Start untangling Parsoid internals from extensions

https://gerrit.wikimedia.org/r/565563

Change 570919 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Cite: Remove more Parsoid internals knowledge

https://gerrit.wikimedia.org/r/570919

One thing I am finding is that Cite (and possibly our other extensions in the Parsoid codebase) have knowledge of DOM state and how Parsoid optimizes handling of data-* attributes by keeping them off to the side in a a bag. So, we can resolve this situation in one of two ways:

  1. Extensions don't have knowledge of the DOM state and needn't have to putz around with visitAndLoad, visitAndStore gymnastics for perf optimization, the API has to handle that transparently as far as possible. In this mode, we document the fact that extensions shouldn't make any assumptions about how / where data-attributes of DOM nodes are stored since the implementation details might change. We could require extensions to go through provided API / helpers to read/write data-* attributes.
  1. We need to document DOM states (raw DOM, post-processed DOM) and make those first-class Parsoid DOM concepts and require extensions to be aware what state the DOM is in and use appropriate helper methods.

Not yet sure what is desirable.

ssastry added a comment.EditedFeb 7 2020, 4:56 PM

Here is the list of helper utilities used in all of src/Ext/*. Some of them are just convenience helpers. But, the visitAndLoad/StoreDataAttribs as well as ContentUtils helpers expose Parsoid implementation knowledge and need to be handled as per the previous comment.

ContentUtils::ppToDOM
ContentUtils::ppToXML
ContentUtils::shiftDSR
ContentUtils::toXML
DOMUtils::assertElt
DOMUtils::hasNChildren
DOMUtils::isBody
DOMUtils::isDiffMarker
DOMUtils::isElt
DOMUtils::isText
DOMUtils::matchTypeOf
DOMUtils::migrateChildrenBetweenDocs
DOMUtils::migrateChildren
DOMUtils::selectMediaElt
DOMDataUtils::addAttributes
DOMDataUtils::addTypeOf
DOMDataUtils::getDataMw
DOMDataUtils::getDataParsoid
DOMDataUtils::setDataParsoid
DOMDataUtils::setDataMw
DOMDataUtils::visitAndStoreDataAttribs
DOMDataUtils::visitAndLoadDataAttribs
TokenUtils::kvToHash
Util::clone
Util::decodeWtEntities
Util::entityEncodeAll
Util::parseMediaDimensions
Util::validateMediaParam
WTUtils::escapeNowikiTags
WTUtils::fromExtensionContent
WTUtils::isSealedFragmentOfType
WTUtils::isNewElt

Change 569702 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Use extension config option for html2wt formatting of extension tags

https://gerrit.wikimedia.org/r/569702

ssastry triaged this task as High priority.Feb 10 2020, 12:55 AM

Change 570919 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Cite: Remove more Parsoid internals knowledge

https://gerrit.wikimedia.org/r/570919

Change 573766 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Remove direct access to Sanitizer from extension code

https://gerrit.wikimedia.org/r/573766

Change 573766 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove direct access to Sanitizer from extension code

https://gerrit.wikimedia.org/r/573766

Change 575093 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Pass $extApi, not $env to extension callbacks

https://gerrit.wikimedia.org/r/575093

ssastry added a comment.EditedFeb 26 2020, 11:51 PM

When all the patches in gerrit are reviewed and merged, we will be left with the following imports in the src/Ext directory:

use Wikimedia\Assert\Assert;
use Wikimedia\Parsoid\Config\ParsoidExtensionAPI;
use Wikimedia\Parsoid\Ext\Extension;
use Wikimedia\Parsoid\Ext\ExtensionTag;
use Wikimedia\Parsoid\Tokens\DomSourceRange;
use Wikimedia\Parsoid\Tokens\KV;
use Wikimedia\Parsoid\Tokens\SourceRange;
use Wikimedia\Parsoid\Utils\DOMCompat;
use Wikimedia\Parsoid\Utils\DOMDataUtils;
use Wikimedia\Parsoid\Utils\DOMUtils;
use Wikimedia\Parsoid\Utils\PHPUtils;
use Wikimedia\Parsoid\Utils\TokenUtils;
use Wikimedia\Parsoid\Utils\Util;
use Wikimedia\Parsoid\Utils\WTUtils;
  • In a parsing team meeting, we agreed that for now, we'll expose DSR offsets as a first-class concept to extensions. So, DomSourceRange, KV (which capture offsets for extension arguments), and SourceRange objects look okay right now.
  • DOMCompat is also okay to expose since that is just a compatibility layer on top of libxml
  • ParsoidExtensionAPI, Extension, ExensionTag are okay of course since they are the core of the Parsoid's exposed API / extension interface.

So, next steps are:

  • What is the best way to export the Tokens\* classes so we don't expose more than we want to?
  • What is the best way to export the util helpers across multiple class files in a way that is maintainable for us ( we tend to refactor and tweak helpers, and by exposing them, we increase the risk of breakage ), and in a way that is easier to use for extension authors ( without having to hunt around various helper files ), and that also feels coherent vs. a rag-tag collection of method that happen to be used right now by these extensions.

Change 575093 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Pass $extApi, not $env to extension callbacks

https://gerrit.wikimedia.org/r/575093

Change 571402 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Cite: Eliminate knowledge of DOM state from a few more call sites

https://gerrit.wikimedia.org/r/571402

Change 574580 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Extensions: Use extApi->parseHTML to create seed documents

https://gerrit.wikimedia.org/r/574580

Change 574604 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Provide extensions SiteConfig & PageConfig access via ParsoidExtensionAPI

https://gerrit.wikimedia.org/r/574604

Change 574605 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Add additional API methods

https://gerrit.wikimedia.org/r/574605

Change 574606 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Add 'shiftDSRFn' option to parse* API

https://gerrit.wikimedia.org/r/574606

Change 576486 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Extensions: Remove direct use of SourceRange

https://gerrit.wikimedia.org/r/576486

Change 576998 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Provide a Utils class for extensions

https://gerrit.wikimedia.org/r/576998

This pretty much brings this first phase of cleanup to an end. There is the stray KV use which I think I can get rid of ... but needs a bit of thought. We should also consider looking at how coherent the ParsoidExtensionAPI interface (as defined by the exposed set of methods are) and do any necessary adjustments to naming and adding / removing methods to bring whatever coherence is necessary. Once the last patch of the chain is merged and we do any other additional cleanup, I think we can publish the first version of a draft extension API.

Change 574604 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Provide extensions SiteConfig & PageConfig access via ParsoidExtensionAPI

https://gerrit.wikimedia.org/r/574604

Change 571402 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Cite: Eliminate knowledge of DOM state from a few more call sites

https://gerrit.wikimedia.org/r/571402

Change 574580 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Extensions: Use extApi->parseHTML to create seed documents

https://gerrit.wikimedia.org/r/574580

Change 574605 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Add additional API methods

https://gerrit.wikimedia.org/r/574605

Change 574606 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Add 'shiftDSRFn' option to parse* API

https://gerrit.wikimedia.org/r/574606

Change 576486 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Extensions: Remove direct use of SourceRange and Frame

https://gerrit.wikimedia.org/r/576486

Change 576998 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Provide utility classes for extensions

https://gerrit.wikimedia.org/r/576998

Change 583640 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Minor: Rename pipelineOpts to parseOpts in ParsoidExtensionAPI

https://gerrit.wikimedia.org/r/583640

Change 583641 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Extension Config Options: Reorg options for ease of use

https://gerrit.wikimedia.org/r/583641

Change 583689 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Give extensions an option to null out DSR values

https://gerrit.wikimedia.org/r/583689

Change 583747 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Cleanup the toHTML / innerHTML mess

https://gerrit.wikimedia.org/r/583747

Change 583485 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Remove exposure of KV objects by expanding ParsoidExtensionAPI

https://gerrit.wikimedia.org/r/583485

With that last patch in the long chain, here is what the Parsoid class imports looks like:

[subbu@earth:~/work/wmf/parsoid/src/Ext] git grep -h '^use ' */* | grep Parsoid | sort | uniq -c
      2 use Wikimedia\Parsoid\Core\DomSourceRange;
      1 use Wikimedia\Parsoid\Core\SelserData;
      1 use Wikimedia\Parsoid\Ext\ContentModelHandlerExtension;
      7 use Wikimedia\Parsoid\Ext\DOMDataUtils;
      7 use Wikimedia\Parsoid\Ext\Extension;
      8 use Wikimedia\Parsoid\Ext\ExtensionTag;
     16 use Wikimedia\Parsoid\Ext\ParsoidExtensionAPI;
      4 use Wikimedia\Parsoid\Ext\PHPUtils;
      3 use Wikimedia\Parsoid\Ext\Util;
      2 use Wikimedia\Parsoid\Ext\WTUtils;
      9 use Wikimedia\Parsoid\Utils\DOMCompat;
      7 use Wikimedia\Parsoid\Utils\DOMUtils;

So, there is no more cheating -- although the API itself has a couple methods that introduces cheating through the backdoor. We can purge them with a bit more work. But for now, it looks like we can call this first phase - of crafting the Parsoid Extension API and using them inside Parsoid's implementations of a bunch of extensions - done! (i.e. once the patches are reviewed and merged).

Change 584052 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Extension API: Adopt somethingToSomethingElse naming wherever possible

https://gerrit.wikimedia.org/r/584052

Change 584120 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a8

https://gerrit.wikimedia.org/r/584120

Change 584120 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a8

https://gerrit.wikimedia.org/r/584120

Change 584720 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Add domToWikitext method + fix Cite to use it

https://gerrit.wikimedia.org/r/584720

Change 583640 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Minor: Rename pipelineOpts to parseOpts in ParsoidExtensionAPI

https://gerrit.wikimedia.org/r/583640

Change 583641 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Extension Config Options: Reorg options for ease of use

https://gerrit.wikimedia.org/r/583641

Change 583689 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Give extensions an option to null out DSR values

https://gerrit.wikimedia.org/r/583689

Change 583747 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Cleanup the toHTML / innerHTML mess

https://gerrit.wikimedia.org/r/583747

Change 585888 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a9

https://gerrit.wikimedia.org/r/585888

Change 585888 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a9

https://gerrit.wikimedia.org/r/585888

Change 583485 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove exposure of KV objects by expanding ParsoidExtensionAPI

https://gerrit.wikimedia.org/r/583485

Change 588421 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a10

https://gerrit.wikimedia.org/r/588421

Change 588421 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a10

https://gerrit.wikimedia.org/r/588421

Change 584052 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Extension API: Adopt somethingToSomethingElse naming wherever possible

https://gerrit.wikimedia.org/r/584052

Change 584720 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ParsoidExtensionAPI: Add domToWikitext method + fix Cite to use it

https://gerrit.wikimedia.org/r/584720

ssastry closed this task as Resolved.Apr 17 2020, 9:09 PM

There is going to be additional tweaking and updates to the API, but I am going to call this first pass done.

Change 592663 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a11

https://gerrit.wikimedia.org/r/592663

Change 592663 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a11

https://gerrit.wikimedia.org/r/592663

Change 607105 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump Parsoid to 0.12.0-a18

https://gerrit.wikimedia.org/r/607105