Page MenuHomePhabricator

Strip src attribute from data-parsoid
Closed, ResolvedPublic0 Estimated Story Points

Description

We had the "src" attribute set in data-parsoid for when we still had bugs in our template encapsulation and serialization, our dsr information was not always entirely trustworthy, and selective serializer was still being tweaked. We used to use data-parsoid.src as a fallback for roundtripping.

However, we no longer have those problems and we should simply strip it out of data-parsoid. On pages like Barack_Obama, this can contribute to a few 100K, I suspect. So, we should audit our serializer code and remove data-parsoid.src from our output.

Event Timeline

ssastry raised the priority of this task from to High.
ssastry updated the task description. (Show Details)
ssastry subscribed.
gerritbot subscribed.

Change 188336 had a related patch set uploaded (by Marcoil):
T88017: Remove data-parsoid.src

https://gerrit.wikimedia.org/r/188336

Patch-For-Review

The patch removes ~123K from Barack_Obama.

Change 188336 merged by jenkins-bot:
T88017: Remove data-parsoid.src for elts with valid data-mw and DSR info

https://gerrit.wikimedia.org/r/188336

Leaving open, further investigation needed to see if there are other cases where data-parsoid.src can also be stripped.

Further investigation results on Barack_Obama: The remaining data-parsoid.src are

  • 1362 <span typeof="mw:Entity">
  • 38 <meta typeof="mw:Extension/ref/Marker">
  • 1 <ol typeof="mw:Extension/references">
  • 1 <meta> without a typeof

As far as I can tell, the first two cases are all in data-mw.html for <ref> expansions. The mw:Extension/ref/Markers are specially worrisome, as their presence makes me think that in some cases expansions inside <ref>s that come themselves from a template expansion aren't cleaned up correctly.

For the purposes of this task, though, completely eliminating these cases would remove ~63K from a total of 3.1M. I'm not sure yet if it can be removed for all cases without round-trip consequences, though.

I think it would be good to get rid of the <meta typeof="mw:Extension/ref/Marker"../> tags from the output. I think a similar fix as for T88019 should do it after making sure they are just something that evaded cleanup.

I think it would be good to get rid of the <meta typeof="mw:Extension/ref/Marker"../> tags from the output. I think a similar fix as for T88019 should do it after making sure they are just something that evaded cleanup.

Yes, at least I'd like to investigate why they aren't being cleaned up after T88019.

Change 192838 had a related patch set uploaded (by Marcoil):
T88017: Remove more cases of data-parsoid.src from mw:Extensions

https://gerrit.wikimedia.org/r/192838

Change 192838 had a related patch set uploaded (by Marcoil):
T88017: Remove more cases of data-parsoid.src from mw:Extensions

https://gerrit.wikimedia.org/r/192838

This new patch removes data-parsoid.src from the <meta typeof="mw:Extension/ref/Marker"> and <ol typeof="mw:Extension/references"> cases mentioned above.

I'm not sure what can be done about page properties and mw:Entities.

Change 192838 merged by jenkins-bot:
T88017: Remove more cases of data-parsoid.src from mw:Extensions

https://gerrit.wikimedia.org/r/192838

After the last patch, there are two cases of data-parsoid.src in Barack_Obama:

  • mw:Entities (), which need src to distinguish between different representations (&nbsp; and &#xA0, for example)
  • Page properties (1, the page's DEFAULTSORT), which need src to distinguish between different valid spellings (like {{DEFAULTSORT}} and {{defaultsort}}.

We could invest time in not keeping src for the default rendering, but I'm not convinced it'd be a big win. Or just leave everything in the hands of selser, which should maintain the correct source.

I don't think these other things will make a big difference in size (since entities and magic words are going to be very few on any particular page, and the .src property will also be small), so not worth trying to do anything more with it.

All patches now in production.

ssastry removed a project: Patch-For-Review.