Page MenuHomePhabricator

Enable $wgHtml5 on Wikimedia wikis
Closed, ResolvedPublic

Description

Author: ayg

Description:
$wgHtml5 is usable in 1.17 -- get rid of $wgHtml5 = false in the config files. It was only there because the old 1.16 snapshot used by Wikimedia didn't produce well-formed XML. Even the 1.16 release works fine.

The only possible negative side-effect would be that some pages might not be well-formed XML due to bugs, e.g., if there are named entities that have crept in since r68803. These should be fixable easily on a case-by-case basis. $wgHtml5 = true is the default since 1.16, so we want to test it on the main site and fix any resulting bugs even if there are any.


Version: unspecified
Severity: enhancement

Details

Reference
bz27478

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:22 PM
bzimport set Reference to bz27478.
bzimport added a subscriber: Unknown Object (MLST).

How could we find fixable items? Would loading a dump and checking for pages that weren't well-formed work?

I'm interested b/c this tickles my inner XML nerd.

ayg wrote:

The problem wouldn't be in pages, it would be in code. The code should always still output well-formed XML, but there might be bugs. Well-formedness errors from anything run through the parser should be a nonissue, but special pages or such that output stuff directly might sneak in some entities, in which case that page would break.

(Of course, you've always been able to get the parser to output non-well-formed output, even in non-HTML5 mode. Some of the expected fails in parser tests exhibit this.)

I'm sure lots of entities have crept in, since nobody has been enforcing that (I've made few comments in code review).

(In reply to comment #3)

I'm sure lots of entities have crept in, since nobody has been enforcing that
(I've made few comments in code review).

Could you point to an example of this in code review so that I know what to look for?

ayg wrote:

(In reply to comment #3)

I'm sure lots of entities have crept in, since nobody has been enforcing that
(I've made few comments in code review).

Should be fixed by and large in r82413. Some corner cases will surely come up anyway in deployment, because of things like messages being output as raw HTML and admins putting entities in them, but it shouldn't be too hard to fix. Worst that happens is some screen-scrapers temporarily break. I originally discussed this with Brion and Tim and they agreed the extra long-term pain for screen-scrapers was an okay risk (maybe even a good thing!).

<logmsgbot> !log demon synchronized php-1.17/wmf-config/InitialiseSettings.php 'Turning HTML5 back off for now. Reports of breakage on zhwiki in Internet Explorer on XP. Also people are complaining about userscripts breaking, but its probably screen scraping (which people shouldn't be doing anyway and we've been saying for years)'

Turns out zhwiki issue is unrelated, oh well.

ayg wrote:

Since the zhwiki issue is unrelated, can this be turned back on? If we care about regex screen-scrapers for some reason, then relay the exact errors people are getting so I can take a look and see what I can do.

alexsm333 wrote:

You explained exactly the same error with scraping in 2009:
[[Wikipedia:Village pump (technical)/Archive 67#Twinkle stalling]]

Also bug 27672 was filed yesterday.

Several problems on enwiki were caused by the difference in Sanitize::escapeId between HTML4 and HTML5 modes.

<ref name="foo"> tries to generate a link in the page something like [1]. In HTML4 mode <ref name="foo [[bar]]"> generates [1] which functions correctly, but in HTML5 mode it generates #cite_note-foo_[[bar-0|[1]]] which of course breaks horribly.

Also, in HTML4 mode <ref name="foo">, <ref name="_foo_">, and <ref name="''foo''"> are all distinct. In HTML5 mode these are all considered equivalent.

ayg wrote:

(In reply to comment #11)

You explained exactly the same error with scraping in 2009:
[[Wikipedia:Village pump (technical)/Archive 67#Twinkle stalling]]

Also bug 27672 was filed yesterday.

This suggests maybe some named entities have crept through, or some other type of well-formedness. It would be nice if people said which exact pages failed, but it would probably be possible to figure it out. I'm guessing it's the result of messages being passed as raw HTML and sysops adding named entities to them, but it could be something else too.

The easy way out would be to restore the old hack where we serve HTML5 with an HTML 4.01 Strict doctype, which is valid HTML5 but rather confusing. This is how 1.16 works by default. That way a DTD is specified, which means that non-browser UAs will parse named entities successfully. We can consider switching back to the HTML5 doctype later.

(In reply to comment #12)

Several problems on enwiki were caused by the difference in Sanitize::escapeId
between HTML4 and HTML5 modes.

Hmm. This should be disable-able by setting $wgExperimentalHtmlIds to false, leaving $wgHtml5 true (which might leave well-formedness issues). A proper fix will require some more thought, though. The changes to escapeId() are really meant for headings, but we can't realistically distinguish wikilinks meant to point at headings from wikilinks meant to point at other things.

In practice, it looks like Cite is the major problem here (with the id's), and it can probably be fixed. My first inclination is to just generate arbitrary id's for named refs instead of trying to key off the names.

ayg wrote:

For reference, from #wikimedia-tech (contains some additional links that should be checked when fixing):

[110224 11:45:22] <RoanKattouw> AryehGregor: https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Vpt#More_than_Twinkle_is_broken , https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Vpt#Merged_Reflinks , https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Vpt#Recent_Javascript_changes
[110224 11:46:53] <RoanKattouw_away> AryehGregor: Due to those issues, HTML 5 was switched back off today

cbm.wikipedia wrote:

This is blocking some work for a GSOC project to improve the article assessment system on enwiki. If there is nothing blocking any longer, it would be nice to see wgHtml5 re-enabled. I'm just commenting here to bump the bug.

(In reply to comment #15)

This is blocking some work for a GSOC project to improve the article assessment
system on enwiki. If there is nothing blocking any longer, it would be nice to
see wgHtml5 re-enabled. I'm just commenting here to bump the bug.

The best summary of the remaining issues (and a path for re-deployment) can be found here: http://lists.wikimedia.org/pipermail/wikitech-l/2011-June/053775.html.

Really this just needs a sysadmin with the time and patience to shepherd this through. Mark (of Bugmeister fame) should probably assign this to someone.

(In reply to comment #16)

Really this just needs a sysadmin with the time and patience to shepherd this
through. Mark (of Bugmeister fame) should probably assign this to someone.

Message heard! Action being taken!

(In reply to comment #16)

The best summary of the remaining issues (and a path for re-deployment) can be
found here:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-June/053775.html.

Some clarifications on the deployment plan, after talking to Aryeh on IRC:

Stage 1: Set the doctype to HTML 4.01 strict. This is done by setting $wgDocType = '-W3CDTD HTML 4.01//EN'; and $wgDTD = 'http://www.w3.org/TR/html4/strict.dtd'; . Per Aryeh's post this should only cause minor layout issues (category 1 in Aryeh's post).

Stage 2: Once any issues from stage 1 are fixed, set an HTML5 doctype without enabling $wgHtml5. Because the doctype tag is structured differently, you can't use $wgDocType / $wgDTD but you have to live hack it in. In Html::htmlHeader(), change the if ( $wgHtml5 ) test to something like if ( $wgHtml5 || $wgSomethingElse ) or if ( $wgHtml5 || true ) or whatever you like. This may and probably will lead to category 2 breakage.

Stage 3: Once that's working, actually set $wgHtml5 = true; . Category 3 breakage possible.

Once everything has been running smoothly for a couple of days, take out the live hack and the $wgDocType / $wgDTD settings.

(In reply to comment #19)

Is there an ETA for this?

Ideally we want to get this somewhat done before Aryeh is on long term leave (from the internet).

I keep finding enough other stuff to do, so haven't got round to it. Hopefully when I get the Metrics stuff out of the way.

Is it blocking you for something?

(In reply to comment #20)

Is it blocking you for something?

It's blocking my GSoC mentee (Yuvi - porting WP1.0 bot to Mediawiki extension). We're a few weeks away from being ready for deployment on the cluster, so I'm hoping to get a sense of when this'll be resolved for planning/etc. And to make sure this is still on the radar :p

Thanks Reedy!

ayg wrote:

It turns out I'll likely be available to give advice for longer than I thought, at least until mid-October and possibly for months beyond that.

(In reply to comment #22)

It turns out I'll likely be available to give advice for longer than I thought,
at least until mid-October and possibly for months beyond that.

That's useful to know, but shouldn't leave it to the end either.

I'll see about getting it bumped up the priority list in the near future

could we perhaps enable this on mw wiki to start testing on a more "content"ish wiki compared to just test and test2?

(In reply to comment #24)

could we perhaps enable this on mw wiki to start testing on a more "content"ish
wiki compared to just test and test2?

We could... I'm not sure if we need to go the effort of 101 mailing list posts, or it's just a JFDI and deal with the issues as they come up

(In reply to comment #25)

(In reply to comment #24)

could we perhaps enable this on mw wiki to start testing on a more "content"ish
wiki compared to just test and test2?

We could... I'm not sure if we need to go the effort of 101 mailing list posts,
or it's just a JFDI and deal with the issues as they come up

Done.

Step 1 complete

BTW, I did an informal inventory of display issues caused by switching mediaWiki to HTML5 (due to the rendering mode change from semi-quirks to strict in some browsers). The only issue I saw that was noticeable was the placement of the magnifying glass button in the search field (It displays slightly lower on HTML5 wikis). I imagine other display issues will be present, but that's the only one I could find (testing on 1.17 and 1.18).

(In reply to comment #27)

BTW, I did an informal inventory of display issues caused by switching
mediaWiki to HTML5 (due to the rendering mode change from semi-quirks to strict
in some browsers). The only issue I saw that was noticeable was the placement
of the magnifying glass button in the search field (It displays slightly lower
on HTML5 wikis). I imagine other display issues will be present, but that's the
only one I could find (testing on 1.17 and 1.18).

That sounds like bug 32025 (took me forever to find that >_> )...

Actually, it's bug 30525. I've added it as a dependency.

Can we test this on labs perhaps, like straight after the 1.19 deploy perhaps ?

it's enabled on mediawiki.org noww

Why is this scheduled for the mysterious future?

Well, basically when a bug is accepted and is agreed to be executed it gets a milestone. This one was scheduled for 1.18wmf1 deployment but that didn't happen.

Since then a blocking bug was added, which hasn't been fixed yet. So whatever milestone set, will be useless since the blocking bug needs to be fixed first.

"Mysterious future" basically means "1.(next) release". Which is postponed until it can be done. But it has been assigned to be in a release. Just not sure yet which one, depending on the dependencies.

(In reply to comment #33)

Since then a blocking bug was added, which hasn't been fixed yet. So whatever
milestone set, will be useless since the blocking bug needs to be fixed first.

Read the bug, It doesn't really block this getting done. In fact this will kinda fix it (if we are both talking about 34475).

We should just get around to setting a date and give warning then flipping the switch, Tool authors have been warned several times already that this will be happening and if they still haven't updated, well…

Discussion started on wikitech-l:
See http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/62146

If there are no serious objections, we'll set a date for sometime in July. Exact date will be set here: http://wikitech.wikimedia.org/view/Software_deployments , and should get reflected back to this bug.

(In reply to comment #12)

Several problems on enwiki were caused by the difference in Sanitize::escapeId
between HTML4 and HTML5 modes.

<ref name="foo"> tries to generate a link in the page something like
[1]. In HTML4 mode <ref name="foo [[bar]]"> generates
[1] which functions correctly, but in
HTML5 mode it generates #cite_note-foo_[[bar-0|[1]]] which of course breaks
horribly.

I believe this is bug 27694.

Also, in HTML4 mode <ref name="foo">, <ref name="_foo_">, and <ref
name="''foo''"> are all distinct. In HTML5 mode these are all considered
equivalent.

I'm not sure this is still an issue. Can you take a look at https://test.wikipedia.org/wiki/Cite_anchor_equivalency and confirm?

Are there any other known Cite-related issues?

(In reply to comment #36)

(In reply to comment #12)

I believe this is bug 27694.

It appears someone copied it from an enwiki post based on comment #12 to a new bug.

Also, in HTML4 mode <ref name="foo">, <ref name="_foo_">, and <ref
name="''foo''"> are all distinct. In HTML5 mode these are all considered
equivalent.

I'm not sure this is still an issue. Can you take a look at
https://test.wikipedia.org/wiki/Cite_anchor_equivalency and confirm?

As pointed out in comment #13, it only occurs if $wgExperimentalHtmlIds is true. Does test.wikipedia.org have this true or false?

(In reply to comment #38)

As pointed out in comment #13, it only occurs if $wgExperimentalHtmlIds is
true. Does test.wikipedia.org have this true or false?

It's not mentioned in the config. The default is false.

It's August now, and there does not appear to be any serious objections. Is this going to be deployed soon?

(In reply to comment #39)

It's August now, and there does not appear to be any serious objections. Is
this going to be deployed soon?

*bump* And it's September now :) If there are still concerns or possible problems, maybe first enable it on smaller wikis?

(In reply to comment #40)

(In reply to comment #39)

It's August now, and there does not appear to be any serious objections. Is
this going to be deployed soon?

*bump* And it's September now :) If there are still concerns or possible
problems, maybe first enable it on smaller wikis?

I don't think there's anything stopping this... If you want it enabling on some of the smaller wikis that you can use (and as such possibly help them if they have problems), I don't mind enabling it more widely too

reedy@fenari:/home/wikipedia/common$ mwscript eval.php testwiki

var_dump( $wgExperimentalHtmlIds );

bool(false)

^ I suppose I should set that to true on testwiki, test2wiki and mediawikiwiki for starters

Though:

/**

  • Should we allow a broader set of characters in id attributes, per HTML5? If
  • not, use only HTML 4-compatible IDs. This option is for testing -- when the
  • functionality is ready, it will be on by default with no option. *
  • Currently this appears to work fine in all browsers, but it's disabled by
  • default because it normalizes id's a bit too aggressively, breaking preexisting
  • content (particularly Cite). See bug 27733, bug 27694, bug 27474. */

$wgExperimentalHtmlIds = false;

(In reply to comment #42)

Should remain false per https://bugzilla.wikimedia.org/show_bug.cgi?id=27694#c4

To be clear, you mean that $wgExperimentalHtmlIds should remain false, not $wgHtml5. :-)

There's a page here for software deployments: http://wikitech.wikimedia.org/view/Software_deployments. I suppose this qualifies. How does this variable ($wgHtml5) get scheduled for a roll-out deployment?

richardg_uk wrote:

Will [[mw:Manual:$wgWellFormedXml]] remain set to true?

" we're scheduling a deployment of HTML5 across the Wikimedia cluster [1]. This is set for Monday
17th September at 18:00-20:00 UTC [2]."

http://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063112.html

@Richard, as far as I know there has been no talk of changing the $wgWellFormedXml configuration. Again however, everything relying on that, probably shouldn't :D