Page MenuHomePhabricator

scrubWikitext doesn't reach Parsoid
Closed, ResolvedPublic1 Estimated Story Points

Description

In https://nl.wikipedia.org/w/index.php?diff=44562366, an indent-pre nowiki is added, despite scrubWikitext being enabled in VE. This is because that normalization only happens on new paragraphs and this one is considered old / edited.

<p id="mwAw"> Oncko Quirijn van Kammen (geboren op 14 maart 1979) is een Nederlandse scenarioschrijver van onder meer kinderfilms. Van Kammen is als schrijver van scripts actief voor film en televisie. </p>

Filing this to see what @ssastry thinks we should do about it.

Event Timeline

Arlolra raised the priority of this task from to Medium.
Arlolra updated the task description. (Show Details)
Arlolra added subscribers: Arlolra, Elitre, ssastry.

Actually, wait a minute, this just isn't true. The way scrubWikitext is currently implemented for indent-pres applies to any paragraphs where we aren't reusing source.

Maybe that's reason the current implementation is better than https://gerrit.wikimedia.org/r/#/c/226667/

But more to the point, maybe scrubWikitext isn't being sent by VE? (or recognized by Parsoid?) Or something else that I'm tired and overlooking.

But more to the point, maybe scrubWikitext isn't being sent by VE? (or recognized by Parsoid?) Or something else that I'm tired and overlooking.

Is RB passing along the API flag? I had asked this on the VE patch and assumed it was given the fact that the VE patch was merged and deployed.

I logged the request body and query in labs,

{ original: 
   { html: 
      { headers: [Object],
        body: '<!DOCTYPE html>\n<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://en.wikipedia.beta.wmflabs.org/wiki/Special:Redirect/revision/239464"><head prefix="mwr: http://en.wikipedia.beta.wmflabs.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="c1cefdf0-1a88-11e5-978a-9977740565c6"/><meta property="mw:articleNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/0"/><meta property="dc:modified" content="2015-06-24T15:50:35.000Z"/><meta about="mwr:user/6273" property="dc:title" content="Lego-test1"/><link rel="dc:contributor" resource="mwr:user/6273"/><meta property="mw:revisionSHA1" content="b2f52fb7928ec2b251c99912bdd95ddd8b1dfd5b"/><meta property="dc:description" content="Pywikibot v.2"/><meta property="mw:parsoidVersion" content="0"/><link rel="dc:isVersionOf" href="//en.wikipedia.beta.wmflabs.org/wiki/UserMergeylylds"/><title>UserMergeylylds</title><base href="//en.wikipedia.beta.wmflabs.org/wiki/"/><link rel="stylesheet" href="//en.wikipedia.beta.wmflabs.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid&amp;only=styles&amp;skin=vector"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki" dir="ltr"><p id="mwAQ">fgmeqswsli</p></body></html>' },
     'data-parsoid': { headers: [Object], body: [Object] },
     revid: '239464' },
  html: '<!doctype html><html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://en.wikipedia.beta.wmflabs.org/wiki/Special:Redirect/revision/239464"><head prefix="mwr: http://en.wikipedia.beta.wmflabs.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="c1cefdf0-1a88-11e5-978a-9977740565c6"><meta property="mw:articleNamespace" content="0"><link rel="dc:replaces" resource="mwr:revision/0"><meta property="dc:modified" content="2015-06-24T15:50:35.000Z"><meta about="mwr:user/6273" property="dc:title" content="Lego-test1"><link rel="dc:contributor" resource="mwr:user/6273"><meta property="mw:revisionSHA1" content="b2f52fb7928ec2b251c99912bdd95ddd8b1dfd5b"><meta property="dc:description" content="Pywikibot v.2"><meta property="mw:parsoidVersion" content="0"><link rel="dc:isVersionOf" href="//en.wikipedia.beta.wmflabs.org/wiki/UserMergeylylds"><title>UserMergeylylds</title><base href="http://en.wikipedia.beta.wmflabs.org/wiki/"><link rel="stylesheet" href="//en.wikipedia.beta.wmflabs.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid&amp;only=styles&amp;skin=vector"></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki" dir="ltr"><p id="mwAQ">fgmeqswsli  </p></body></html>' }
{}

scrubWikitext isn't making it through. Locally, I tested without RESTBase and Parsoid does receive it.

Arlolra renamed this task from Indent-pre scrubbing on edited paragraphs to scrubWikitext doesn't reach Parsoid.Jul 27 2015, 10:53 PM
Arlolra updated the task description. (Show Details)

@Arlolra, when was the Parsoid API changed to accept scrubWikitext? I remember a discussion about maybe adding it a while ago, but wasn't aware that it is now deployed.

In any case, I guess we'll need to expand the API to provide a parameter for HTML cleanup. Maybe something like 'cleanup', 'scrub' or 'sanitize', so that we can use the same parameter for the html2html end point as well?

@GWicke Mid-April, but it wasn't enabled in VE until recently. See T105239

I'm pretty sure @cscott is going to want us to not add more differing param names. But, I guess we can clean this up in T100680

Yes, @cscott is going to be very sad if we continue down this path of gratuitous incompatibility. We already have body vs bodyOnly that I have to clean up.

We discussed this on IRC, and established that scrubWikitext is currently only applying some limited clean-ups that should not have large effects on WYSIWYG rendering: https://www.mediawiki.org/wiki/Parsoid/Normalizations

The plan is to add additional types of cleanups in the future, and let clients select which cleanups / normalizations to apply. Possible examples include MS Office, Google Docs and perhaps a generic catch-all pass.

This feature is important for the VE roll-out, so we agreed to quickly enable forwarding of the scrubWikitext parameter without documenting the feature for now. This will then give us time to discuss the best longer-term API without time pressure.

Esanders claimed this task.