Page MenuHomePhabricator

Prevent browser extensions from inserting junk into the page HTML that gets saved
Open, Needs TriagePublic

Description

@Esanders thinks that we can prevent browser extensions from inserting junk into the page HTML that gets saved. Currently, we try to filter it out after it is inserted, based on a long list of rules for each one we've discovered (see T54327, occasionally updated e.g. T297862).

Ed: Aside from this being a never ending of task of whack-a-mole with an unlimited number of browser extensions and versions, we also risk removing real content. Right now a MediaWiki document is unlikely to contain content that matches these selectors, but as the list grows the chances of this happening increase.

Event Timeline

Change 748842 had a related patch set uploaded (by Bartosz Dziewoński; author: Esanders):

[mediawiki/extensions/Cite@master] Don't attach DOM nodes from the DM store to the main document

https://gerrit.wikimedia.org/r/748842

Change 749173 had a related patch set uploaded (by Bartosz Dziewoński; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Log whenever browser plugin spam is detected

https://gerrit.wikimedia.org/r/749173

Change 748842 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Don't attach DOM nodes from the DM store to the main document

https://gerrit.wikimedia.org/r/748842

Change 764383 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/WikimediaEvents@master] Add 'error.visualeditor' topic

https://gerrit.wikimedia.org/r/764383

Change 749173 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Log whenever browser plugin spam is detected

https://gerrit.wikimedia.org/r/749173

Change 764383 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Add 'error.visualeditor' topic

https://gerrit.wikimedia.org/r/764383

Change 749173 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Log whenever browser plugin spam is detected

https://gerrit.wikimedia.org/r/749173

matmarex removed a project: Patch-For-Review.

Next step is to wait a few days or weeks, and then check the logs.

  • If we find that we're logging no plugin spam, then we should congratulate Ed for fixing the problem with https://gerrit.wikimedia.org/r/748842, pat ourselves on the back, and remove the old code. (But before we do that, check that logging is working.)
  • If we find that we're logging some plugin spam, then we'll have to go back and think about it some more.

Logstash query: https://logstash.wikimedia.org/goto/5fd5d9fcaf327d643a93b3e1ac193ec0
Dashboard with some extra info: https://logstash.wikimedia.org/goto/6c2c46cf3af87dedde43ab6d8717374a

image.png (539×2 px, 54 KB)

The problem is not entirely fixed, but now we have some data about what is getting inserted, so maybe we can make some guesses as to how it happens. Or maybe we can just decide that the current approach with the deny list works well enough.

Initial review of the most common entries:
(I'm not pasting the data from Logstash to ensure we don't accidentally leak some private information)

At a quick glance, it looks like the <script> injection is happening at a network level, before the DOM is even parsed. This means my original hypothesis that in-memory DOM objects aren't getting modified may still be true. We could test this by adding a <script>-stripper to our HTML parser.

The Chimera images must be from a bug in our code. I don't think even copy and paste would allow such HTML to end up in the model.

Change 790723 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Filter <script> tags during parse, instead of save

https://gerrit.wikimedia.org/r/790723

Change 790723 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Filter <script> tags during parse, instead of save

https://gerrit.wikimedia.org/r/790723

Change 791100 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Follow-up I420bfcac8: Fix typo in loop

https://gerrit.wikimedia.org/r/791100

Change 791100 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Follow-up I420bfcac8: Fix typo in loop

https://gerrit.wikimedia.org/r/791100

matmarex removed a project: Patch-For-Review.

Waiting until next deployment to review the logs again.

We're still logging just as many entries as before.

I suggest we drop this for now.

I'm not seeing many entries - in the last month there are 77 hits, all a false positives for either:

  • <style data-mw-deduplicate="... (TempateStyles) because we only filter style:not( [ data-mw ] )

or

  • <img role="none" alt="" class="ve-ce-chimera ve-ce-chimera-gecko" src="data:image/gif...

Not sure how ve-ce-chimera's are leaking through to the DM? CC @dchan

Change #1020829 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Fix selector for TemplateStyles exclusion

https://gerrit.wikimedia.org/r/1020829

Change #1020829 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Fix selector for TemplateStyles exclusion

https://gerrit.wikimedia.org/r/1020829