Page MenuHomePhabricator

Don't use hacks to replicate a browser function, either let align="" pass through or not; in HTML5
Closed, ResolvedPublic

Description

Author: mr.heat

Description:
Example code, tested in the current 1.20wmf11 in the German Wikipedia which uses HTML5 mode since yesterday.

{| class="wikitable" width="600"

- align="center"
Bug number one, the align attribute is deleted.
-
align="center"

{| class="wikitable" width="300"

Should be centered.
}

But number two, the attributes is replaced with the text-align property. This is '''not''' the same and should not be done.

-
align="center"<center>

{| class="wikitable" width="300"

Is centered.
}

I'm sorry but I find this very funny. The center tag is ''not'' deleted but the align attribute is? WUT?
</center>

}

Don't tell me the align attribute is deprecated and should not be used. I know that. That's not the problem. The problem is this change currently creates a lot of unexpected errors in templates and articles for no reason. Almost no template author was aware of this. I would like to help fixing all templates but I need ''time'' to do that.

All web browsers are able to do the right thing when they get these align attributes. They will do the right thing for years. This will not be dropped. It can't. There is no need to get rid of the attribute right now. Why? As said, we should try to avoid it. But we don't need to do this ''today''.

TL;DR: Allow align attributes in HTML5 mode in the same places they were allowed before.


Version: 1.20.x
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=66413

Details

Reference
bz40329

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:58 AM
bzimport set Reference to bz40329.

See also bug 40306 about align issue with table. Was fixed on live site during last hour.

Allow align attributes in HTML5 mode in the same places they were allowed before

The align attribute as been deprecated in HTML 4 and is no more supported in HTML 5.

I have troubled understand what is the issue reported there. Can you please give us simple test case showing what is being output and what you would expect ?

mr.heat wrote:

I posted a test case above. Everything in this table must be centered similar to the bottom row.

http://de.wikipedia.org/wiki/Wikipedia_Diskussion:Projektneuheiten#Wenn_HTML5_bedeutet

It's not true that "the align attribute is no more supported". *EVERY* web browser is perfectly able to do the right thing even in HTML5 mode. This will not change for many years. There is no need to burn all align attributes.

If it needs to be burned for an "according to the spec" reason we need time to find these attributes in every Wikipedia article and every template and find other solutions. As said and as shown in the example it is *NOT* possible to simply replace every align="center" with style="text-align:center". This is not the same. This will not work in some cases (read, in cases where block elements are involved because text-align is for inline elements only but align is for both).

In some cases we may be able to replace <td align="center"><table> with <td><table style="margin:auto">. But we need time to do this and to test it. I *KNOW* this will not work in some older browsers. This is the reason why we used the align attribute and not margin:auto. In cases where the outer table cell an the inner block element are from different templates it may not be possible at all or at least not so easy. It's definitely not a good solution to hack a {{{style|}}} parameter into every template with this bug.

And again: Why does the <center> tag still work but the align attribute does not work? This does not make sense. Do you really want us to replace every align="center" with <center>? Hell, no. The align="center" is fine. It works. It does it's job where a text-align is not enough.

TL;DR: Either remove <center> or continue supporting the align attribute.

Is the world ending ? Because from your comments it seems like it.

Summary:

1: Align attribute is still support
2: Align attribute was deprecated in HTML4
3: Align attribute is no longer valid in HTML5
4: Current fixDeprecateAttributes() only works for align elements that have inline content and is not fully compatible with block elements
5: We don't rewrite all non-valid elements yet (such as <center>

Solution is something in the order of:
elements-with_align_attribute_center > block element {
margin-left: auto;
margin-right: auto;
}

This might require some class to be added to properly target those children block elements. or we simply need to sanitize the DB contents.

For now, we probably should revert the align sanitation from fixDeprecateAttributes()

I would strongly discourage from keeping the behaviour allowing people to write obsolete (already for ages!) and horrible constructions.

Let's not support wrong constructions which cause inaccessibility and bunch of issues in alternative user agents.

It is wiki, anybody can replace the original bad code of page with new one, very often it's doable by bots simply.

By the way, in past we had the ridiculous class .hiddenStructure. We got rid of it. And people eventually fixed all pages which used that. The same can happen here.

(In reply to comment #0)

The problem is this change currently creates a lot of unexpected errors
in templates and articles for no reason.

Errors? How can different *alignment* be an error? It is just different way of presentation of data, but not error.

Anyway, correct way how to deal with this is to get rid of align attribute and center tag anywhere they appear.

I disagree with that Danny. In my opinion, an INCORRECT 'translation' is worse than outputting invalid HTML5 that works on most browsers.

(In reply to comment #8)

I disagree with that Danny. In my opinion, an INCORRECT 'translation' is worse
than outputting invalid HTML5 that works on most browsers.

Do not translate. Ignore it. "Translating" (= guessing the meaning of incorrect construction by browser) is the hell we've been experiencing in past years. Simply cut it out if it's no longer in specs. That's the only correct way how to deal with it. All other ways just support spreading of incorrect constructions all around thus causing issues elsewhere. It is necessary to correct the original cause (which is wrong code) and not create set of patches over each other which will create issues elsewhere thus require other patches and so on eternally.

mr.heat wrote:

I'm fine with this suggestion. Drop and ignore all align attributes completely.

Either that or continue supporting it.

But remove the current inconsistent "translate to some CSS that may or may not work" hack. As Derk-Jan Hartman said this bad "translation" is worse than everything else.

When it comes to removing/handling deprecated stuff, it's always the same: Some people think it should be removed (The valid ones), other people prefer attempting to handle it (The diligent ones) and some won't understand why to change something that still works (The conservative ones).

If you should decide to remove stuff that is invalid in HTML5, you should care for adequately announcing -- a long time before you start the removal. Otherwise you'll get lots angry reports by the conservative ones and the diligent ones will ask you why you're not "simply" transform to valid HTML5.

mr.heat wrote:

Let me repeat: The current hack is stupid. Either you still support the align attribute or not. There is no "attempt to handle it". "Handling it" is the responsibility of the web browsers. The web browsers know perfectly what to do with these attributes. You are currently doing the job of the web browsers and you are doing it bad. It does not make any sense to remove the attribute and insert some random properties instead that are not the same.

Continue supporting the attribute for a short time.

Tell the people it will be dropped.

Then drop it.

Simple.

(In reply to comment #13)

Let me repeat: The current hack is stupid. Either you still support the align
attribute or not. There is no "attempt to handle it". "Handling it" is the
responsibility of the web browsers. The web browsers know perfectly what to do
with these attributes. You are currently doing the job of the web browsers and
you are doing it bad. It does not make any sense to remove the attribute and
insert some random properties instead that are not the same.

Continue supporting the attribute for a short time.

Tell the people it will be dropped.

Then drop it.

Simple.

Except that kind of stance only applies if you are a HTML based language outputting HTML. We're talking about WikiText. WikiText supports a subset of html markup within it's own syntax. It is not all of html, and it does not work entirely the same as html. Our responsibility is to take whatever WikiText the user inputs and turn it into proper HTML output that tries to match what the user intended as best as we can understand what that is.

For better or worse we supported align, valign, etc... as part of WikiText itself. But outputting those directly creates incorrect output.

Telling thousands of wikis their pages are going to be messed up on upgrade in every single instance they used something that we've supported for years and they need to go through all their content and modify it, including wikis like Wikipedia with several million articles that may need modification. Isn't really much better of an option then telling those wikis that some of their pages are going to be slightly messed up and they'll need to make a few tweaks, but in the mass majority of cases things will still look fine.

I hardly doubt align and valign are part of wikitext. Then you would have to say that id, class, style, title and other common HTML attributes are part of wikitext as well. No, they are not. In fact you can pass various attributes to many elements (either in wikitext or html), even inexisting, and they are simply passed out and Tidy deals with them.

The only attribute handled is style, which does sanitization of external resources.

I totally agree with comment #13 - handling of code is business of the browser, not of the MediaWiki. (In fact it's primarily business of writers of that code anyway...)

Attempting to "translate" attributes to different output will also cause messing of the design, because you never can say how browsers were handling it (and different browsers handle stuff differently). I've seen a lot of various ridiculous constructions which caused "proper" displaying and this "translation" attempt would break it immediately. Those constructions had to be replaced manualy by human, who knows (and can see from context) the dependencies.

Summary: Do not have MediaWiki to deal with it, have browsers to deal with it for the time being. And after some period (a year after announcement seems reasonable) just drop the support of obsolete invalid constructions.

(In reply to comment #15)

I hardly doubt align and valign are part of wikitext. Then you would have to
say that id, class, style, title and other common HTML attributes are part of
wikitext as well. No, they are not. In fact you can pass various attributes to
many elements (either in wikitext or html), even inexisting, and they are
simply passed out and Tidy deals with them.

The only attribute handled is style, which does sanitization of external
resources.

They are all part of WikiText. Every attribute you are permitted to use is part of WikiText. Each one is explicitly allowed. We just define most of them to work the same way as HTML defines them.
Tidy doesn't do anything here. It's not responsible for ANY of the filtering. Attributes permitted inside WikiText are handled by the Parser.php related code inside Sanitizer.php. And Parser.php/Preprocessor*.php parses all the HTML we support. If our parser doesn't parse the markup, then it's not outputted as HTML. So ALL of it is handled by our WikiText parser.
We output br in the same way whether you write <br> or <br />. Whether you write class=foo, class="foo", or class='foo' we output it as class="foo".

And besides the explicit whitelisting of only the attributes we permit style isn't the only attribute that we handle specially.
The id attribute is modified, depending on wiki settings we may replace some characters with dots or in other cases other characters with underscores, and we always decode entities you input into an id.
If html5 output is enabled as a feature. We handle data- attributes specially, and strip them when disabled.
If microdata or rdfa are enabled as features we special case permit them. We also reject inputs that we don't like (ie: they look like an evil uri).
For microdata we have some extra handling. If you don't use itemscope we unset itemtype, itemid, and itemref.
And while we don't have anything that explicitly permits href or src yet, we already have the related code saying that only urls matching our protocol whitelist are allowed.

I totally agree with comment #13 - handling of code is business of the browser,
not of the MediaWiki. (In fact it's primarily business of writers of that code
anyway...)

Attempting to "translate" attributes to different output will also cause
messing of the design, because you never can say how browsers were handling it
(and different browsers handle stuff differently).

If what you're saying is true. Then we have even more reason to do the translation. Our job is to provide output that always works the same. Our job isn't to replicate browser quirks. If browsers behave differently with the same output then we should be outputting something that behaves the same in all browsers even if it's different than how html, etc... says the user input should behave.

I've seen a lot of various
ridiculous constructions which caused "proper" displaying and this
"translation" attempt would break it immediately. Those constructions had to be
replaced manualy by human, who knows (and can see from context) the
dependencies.

Either way, in the end whether we translate or drop support humans manually have to replace this stuff. And manually replacing only the things that are actually broken is simpler than replacing everything. Because the fact is that while there are some cases where the translation doesn't behave the same. For the majority of cases the translation actually works for what the user wanted it to do.

Summary: Do not have MediaWiki to deal with it, have browsers to deal with it
for the time being. And after some period (a year after announcement seems
reasonable) just drop the support of obsolete invalid constructions.

mr.heat wrote:

(In reply to comment #16)

We just define most of them to work the same way as HTML defines them.

No, you do not "define" anything. All you do is whitelisting. Everything is simply passed through or not. MediaWiki never did any translations and never should. This is confusing and bogus.

We output br in the same way whether you write <br> or <br />. Whether you
write class=foo, class="foo", or class='foo' we output it as class="foo".

Fixing slashes and quotes is not the same. class=foo and class="foo" still have the same meaning. The replaced align stuff does *not* have the same meaning.

we always decode entities you input into an id.

Again, non of this changes the *meaning* of the input.

we already have the related code saying that only urls matching our
protocol whitelist are allowed.

Adding or removing something from a whitelist is not the same as *changing* something with something else that has a *different* meaning. As said I'm fine with removing the align attribute from the whitelist.

Our job isn't to replicate browser quirks.

Are you kidding? This is what you currently do. This is what this bug report is about. You are trying to replicate a browser quirk and you are doing it bad. Drop this please.

And manually replacing only the things that are
actually broken is simpler than replacing everything.

What about *not* breaking anything?

(In reply to comment #17)

we always decode entities you input into an id.

Again, non of this changes the *meaning* of the input.

Except you glossed over the part where we go and modify the id itself.
<div id="A+B"></div> -> <div id="A.2BB"></div>
<div id="A B"></div> -> <div id="A_B"></div>

That's not part of html.

we already have the related code saying that only urls matching our
protocol whitelist are allowed.

Adding or removing something from a whitelist is not the same as *changing*
something with something else that has a *different* meaning. As said I'm fine
with removing the align attribute from the whitelist.

Our job isn't to replicate browser quirks.

Are you kidding? This is what you currently do. This is what this bug report is
about. You are trying to replicate a browser quirk and you are doing it bad.
Drop this please.

No that's not a browser quirk. That's a compatibility quirk. They are two completely different things. With the former, two users on different browsers may see two different things even though they are given the same markup. With the latter, two users on different browsers will always see the same thing, even if that output is not completely what they would expect. The later is much more reliably consistent. It's easy to fix something that looks broken right in front of you. While fixing an issue that doesn't look broken when you look at it is really hard.

And manually replacing only the things that are
actually broken is simpler than replacing everything.

What about *not* breaking anything?

You just contradicted yourself. You just said that you were ok with align being removed from the whitelist. Yet that breaks things.
It's still a valid point that translations still work in most situations

mr.heat wrote:

(In reply to comment #18)

two users on different browsers may see two different things

What are you talking about? This is not true. In the previous Wikipedia setup with HTML5 disabled all users got the same HTML output and it looked the same in all web browsers. Now this is broken. It does *not* look the same when a template is tested locally for example. The hack creates a *different* result when the finished code is finally put into a template. The hack *changes* the *meaning* of the code no matter if these attributes were used for a reason or not. This is confusing as hell. It makes creating templates a mess (not that it isn't a mess anyway).

None of your examples ever changed the *meaning* of some code, not even the ID example.

You just contradicted yourself. You just said that you were ok with align
being removed from the whitelist. Yet that breaks things.

I will try to speak very slow: Either remove all align attributes (drop it from the whitelist) or let them pass through (keep it whitelisted). In other words, either break *everything* or *nothing*. The current hack breaks some *random* stuff. This is not only confusing, it's completely unnecessary because every web browser is able to handle the align attributes well. You are replicating something that clearly is the responsibility of the web browser.

While fixing an issue that doesn't look broken when you look at
it is really hard.

Again, this must be a joke. That's exactly the problem of the current hack. It makes broken code *not* look broken. It does not fix anything. It does not help people to fix their outdated template code. It does the *opposite*. It's a stupid counterproductive hack. All it does is adding confusion and breaking some random templates for no good reason.

It's still a valid point that translations still work in most situations

Again, *not* translating this worked in *all* situations till last week. You can't say all the templates we developed in the past years are broken. They are *not*. They were tested in all browsers. Nothing was broken till you started to change our code.

Just FYI: Bug 40306 is handling the problem with the align properties in HTML5 not being translated correctly to CSS. That bug was fixed and changes were already deployed to Wikimedia servers, so the original problem is no longer occurring. Yes, align properties aren't being outputted now, but browsers should render now the tables correctly as if they were still there.

I think that transforming properties to valid HTML5/CSS syntax is fine, when it's done properly and without breaking things. No need to completely drop the align support *now*. The gradually remove support from those elements/attributes is being handled on bug 24529.

Is it really necessary to leave this bug open and continue commenting on it? What is it's current purpose?

(In reply to comment #19)

(In reply to comment #18)

two users on different browsers may see two different things

What are you talking about? This is not true. In the previous Wikipedia setup
with HTML5 disabled all users got the same HTML output and it looked the same
in all web browsers. Now this is broken. It does *not* look the same when a
template is tested locally for example. The hack creates a *different* result
when the finished code is finally put into a template. The hack *changes* the
*meaning* of the code no matter if these attributes were used for a reason or
not. This is confusing as hell. It makes creating templates a mess (not that it
isn't a mess anyway).

None of your examples ever changed the *meaning* of some code, not even the ID
example.

You originally said "because you never can say how browsers were handling it
(and different browsers handle stuff differently)" which seamed to imply the suggestion of a browser quirk where different browsers have different results for the same deprecated markup. That's what this separate sub-discussion was about.

You just contradicted yourself. You just said that you were ok with align
being removed from the whitelist. Yet that breaks things.

I will try to speak very slow: Either remove all align attributes (drop it from
the whitelist) or let them pass through (keep it whitelisted). In other words,
either break *everything* or *nothing*. The current hack breaks some *random*
stuff. This is not only confusing, it's completely unnecessary because every
web browser is able to handle the align attributes well. You are replicating
something that clearly is the responsibility of the web browser.

Browsers support deprecated and removed html so that they can correctly render ancient websites written using HTML 3.2. We are not outputting HTML 3.2, so it's our responsibility to not output stuff that was removed. As the maintainers of WikiText it is also our responsibility to ensure that pages written in WikiText continue to work except when there is a bug we have to fix and we can't fix that bug without breaking a minimal number of pages.
Outputting invalid HTML is a bug. Breaking every article when we are capable of only breaking 5 of every 6. Hence fixing the bug by translating WikiText to use valid HTML that keeps as many articles working as we can is preferable to just breaking everything that relied on the bug.

While fixing an issue that doesn't look broken when you look at
it is really hard.

Again, this must be a joke. That's exactly the problem of the current hack. It
makes broken code *not* look broken. It does not fix anything. It does not help
people to fix their outdated template code. It does the *opposite*. It's a
stupid counterproductive hack. All it does is adding confusion and breaking
some random templates for no good reason.

If the output doesn't look broken to you. The output doesn't rely on browser quirks that would cause it to look broken to another user. And the output is not invalid html that constitutes a bug we need to fix. Then your wiki page is not broken.

ie: If you use align="center" in WikiText, and the translation to style="text-align: center;" in HTML looks ok in your articles. Then there is no reason to stop using align="center". You're writing WikiText not HTML, so whether your WikiText validates under the HTML spec is irrelevant.
When the output is valid. It's only broken if it looks broken.

It's still a valid point that translations still work in most situations

Again, *not* translating this worked in *all* situations till last week. You
can't say all the templates we developed in the past years are broken. They are
*not*. They were tested in all browsers. Nothing was broken till you started to
change our code.

Yes, they were broken. Our code outputted attributes that were deprecated and removed from html into page output. ie: MediaWiki outputted incorrect markup. Which is a bug in the software. And you made templates dependent on browser quirks and MediaWiki outputting bad markup forever... ie: You relied on a bug in the software. So yes, your templates were broken And now we're fixing the bug, repurposing that WikiText to output valid HTML/CSS that is as close as we possibly can to how you wrote templates intending to behave.

mr.heat wrote:

(In reply to comment #20)

when it's done properly and without breaking things.

It is not done properly. It breaks things. This is what this bug is about. See my example in the first comment.

(In reply to comment #21)

You originally said "because you never can say how browsers were handling it
(and different browsers handle stuff differently)"

I have not said that.

it's our responsibility to not output stuff that was removed.

It was not removed. Every web browser in the world is able to render align attributes properly. With your current hack you are doing the job of the we browser and you are doing it bad. There is no need to replicate something that *every* web browser in the world can do by it's own.

it is also our responsibility to ensure that pages written in WikiText
continue to work

Then do this please. Currently an unknown number of WikiText pages do not continue to work.

except when there is a bug we have to fix

What bug? There was no bug with the align properties. It worked fine for decades and it will continue to work for decades.

Breaking every article

What are you talking about? Absolutely nothing will break if you output the align attributes.

It's only broken if it looks broken.

Again, what are you talking about? It *does* look broken. This is what this bug is about.

MediaWiki outputted incorrect markup.

Then drop your support for this markup if you consider it "incorrect". Tell the people it will be dropped, give them some time and then drop it. Either this or continue to support it. But don't change the *meaning* of *my* code.

So yes, your templates were broken

No, they were not. They worked fine and they will work find for the next decades. I tested them in every browser. There was no bug. *You* introduced a bug.

(In reply to comment #22)

(In reply to comment #20)

when it's done properly and without breaking things.

It is not done properly. It breaks things. This is what this bug is about. See
my example in the first comment.

(In reply to comment #21)

You originally said "because you never can say how browsers were handling it
(and different browsers handle stuff differently)"

I have not said that.

it's our responsibility to not output stuff that was removed.

It was not removed. Every web browser in the world is able to render align
attributes properly. With your current hack you are doing the job of the we
browser and you are doing it bad. There is no need to replicate something that
*every* web browser in the world can do by it's own.

We output HTML5. Align was removed from HTML5. Browsers support align to support HTML 3.2. We don't output HTML 3.2. Hence what the browser can do is irrelevant, it's our responsibility to not put align in the page output.

it is also our responsibility to ensure that pages written in WikiText
continue to work

Then do this please. Currently an unknown number of WikiText pages do not
continue to work.

except when there is a bug we have to fix

What bug? There was no bug with the align properties. It worked fine for
decades and it will continue to work for decades.

align is not valid output it's our job not to output it, that is a bug.

Breaking every article

What are you talking about? Absolutely nothing will break if you output the
align attributes.

That was a reference to the suggestion of removing align="" entirely. If we do that then everything breaks instead of just some things.

It's only broken if it looks broken.

Again, what are you talking about? It *does* look broken. This is what this bug
is about.

You said "It makes broken code *not* look broken. It does not fix anything. It does not help people to fix their outdated template code. It does the *opposite*."

I replied saying that if the output does not look broken then the code is not broken.
In other words, I'm saying that if something uses align="" but does not look broken after we translate that to css. Then the template is not broken and there is no need to change the template code.

MediaWiki outputted incorrect markup.

Then drop your support for this markup if you consider it "incorrect". Tell the
people it will be dropped, give them some time and then drop it. Either this or
continue to support it. But don't change the *meaning* of *my* code.

MediaWiki's HTML output is incorrect. That doesn't make WikiText input incorrect.
We are not going to drop something 100% from WikiText just to fix invalid HTML output when we can instead output valid HTML that works 90% of the time.

So yes, your templates were broken

No, they were not. They worked fine and they will work find for the next
decades. I tested them in every browser. There was no bug. *You* introduced a
bug.

Outputting invalid markup is a bug. It doesn't matter if it worked fine, it was a bug.

I would appreciate it if you would stop splitting up my sentences, replying to a chunk of a sentence as if it were a whole senescence, and completely omitting other parts of the same paragraph as if they didn't exist. Frankly it makes it look like you are trying to commit the "Strawman" logical fallacy just to make a point.

mr.heat wrote:

I have no idea who Strawman is.

You say the align attributes need to be removed from the HTML output. I don't see why this needs to be done *now*. Because of a future spec that is not even finished and that no browser in the world is currently able to fulfill? I say the attributes don't need to be removed. They work everywhere. They will work everywhere for the next 10 years or more. I say it's stupid to "fix" something that is *not* broken. Especially if the *only* thing this "fix" does is to actually *break* something.

Basically you say there is no other way than to stick to a unfinished spec no matter if it makes sense or not.

Like it's a law and not a recommendation.

I'm not a lawyer. I'm a coder.

You say you need to remove "invalid" attributes from the output. But at the same time you are accepting the *same* invalid attributes as input? This is confusing. When I use an align attribute in my code I expect it to work according to the spec. Like it did for many years. Now you introduced your own spec where it started to *not* work in some cases. This is confusing as hell. Till last week it was *my* decision what my code did. Now *you* decide that the same code is invalid in some cases but still valid in other cases. You *dropped* a feature. Why? Where is the documentation for this? Where was the "consider changing your templates" warning a year ago? Where is the discussion were the community decided to removed this part of the WikiText feature set?

Why only this part? Why not simply remove all of the invalid stuff? Why not the <center> garbage? This would be way less confusing. A lot of stuff will break but everybody would *know* why. Even people that were not involved in the discussion.

And you *need* to remove all invalid stuff some day. When will this happen? In 10 years? The same time when the browsers will stop to read the align attributes? If this is true why not simply let the browsers decide if they are fine with an align attribute or not? Why do you think it's your responsibility to make this decision? It wasn't your responsibility till last week. All you did was whitelisting. The align attributes were copied to the output or not. Simple. What you are doing now is to *guess* the meaning of these attributes and to replace it with something else that may or may not have the same meaning.

It's not your responsibility to change the meaning of my code.

All the current hack does is to add confusion.

(In reply to comment #24)

I have no idea who Strawman is.

Not who, what. It's a type of logical fallacy.
Here's a pretty good description of it:
http://yourlogicalfallacyis.com/strawman

You say the align attributes need to be removed from the HTML output. I don't
see why this needs to be done *now*. Because of a future spec that is not even
finished and that no browser in the world is currently able to fulfill? I say
the attributes don't need to be removed. They work everywhere. They will work
everywhere for the next 10 years or more. I say it's stupid to "fix" something
that is *not* broken. Especially if the *only* thing this "fix" does is to
actually *break* something.

Basically you say there is no other way than to stick to a unfinished spec no
matter if it makes sense or not.

I think I'll make this a bullet-pointed list:

  • HTML5 (or as WHATWG calls it "HTML") is a living standard. Completion of the standard is irrelevant. Individual components of the spec are separate. Some components are new and in flux, while others are stable and finished standardization. Removal of presentational attributes is one of those. align is no longer a standard attribute, that is finished standardization.
  • The facts people used to misrepresent to say that HTML5 isn't ready are also not valid anymore: http://wiki.whatwg.org/wiki/FAQ#What.27s_this_I_hear_about_2022.3F
  • Additionally align="" has been deprecated since HTML 4.0. Even if you ignore current standards and try to use the previous obsolete standards these still aren't attributes we're supposed to be outputting from our software.
  • We use HTML5. This whole discussion about whether HTML5 is finished or not is pointless. HTML5 is the standard we are using RIGHT NOW to output MediaWiki markup. Whether something is valid or invalid in a spec we are not using is irrelevant.

Like it's a law and not a recommendation.

I'm not a lawyer. I'm a coder.

You say you need to remove "invalid" attributes from the output. But at the
same time you are accepting the *same* invalid attributes as input? This is
confusing. When I use an align attribute in my code I expect it to work
according to the spec. Like it did for many years. Now you introduced your own
spec where it started to *not* work in some cases. This is confusing as hell.
Till last week it was *my* decision what my code did. Now *you* decide that the
same code is invalid in some cases but still valid in other cases.

WikiText is not HTML, that's a simple fact you'll have to understand. HTML is consumed by browsers and it's format is defined by the HTML standard. While WikiText is consumed (and written) by the MediaWiki parser and human users who do not follow standards. For a user trying to insert a centered table in the middle of the page {| align=center makes perfect sense to create a centered table even though is the wrong way to do it in HTML (which they probably won't know). So it makes perfect sense to take that as an indication to create a centered table. Which in CSS means to apply style="margin-left: 0; margin-right: 0;" as we do now.

You *dropped* a feature. Why? Where is the documentation for this?
Where was the "consider changing your templates" warning a year ago?
Where is the discussion were the community decided to removed this part of the WikiText feature set?

As with everything we change 1.19's RELEASE-NOTES included the introduction of presentational attribute sanitization.
At the time of introduction neither the <table> nor nested block quirks were known so there was no need for any mass notification about incompatibilities since it was supposed to be compatible.
The table float bug was caused by the relevant info being buried in a separate section of the spec. After the table float bug was discovered someone tried fixing it. Which caused bug 40306 since the committer made a mistake and made that special case apply to more than the one tag it applies to. Then recently that bug got fixed and deployed. Now the one single remaining issue is the nested block align quirk.
;) Btw, here's a fun fact. HTML4 does not even appear to specify that behaviour. In fact it has examples using text-align as a replacement for align.

It's been discussed on the mailing list multiple times. But we don't go asking for community approval for each and every single change we make for the software. It's ridiculously unscalable. Nothing would get done.

Why only this part? Why not simply remove all of the invalid stuff? Why not the
<center> garbage? This would be way less confusing. A lot of stuff will break
but everybody would *know* why. Even people that were not involved in the
discussion.

Right now the Sanitizer only supports attribute sanitization. We don't have a way to sanitize whole elements yet. To do that we need to add a new sanitizer api for whole elements, deprecate the old one, and make some (potentially non-trivial) changes to the WikiText parser. So we started with sanitization of what we can do.
So basically sanitization of the rest of the stuff that was removed from HTML will happen when someone gets around to it. I added bug 40579 so you can watch that if you want to know when that gets fixed.

And you *need* to remove all invalid stuff some day. When will this happen? In
10 years? The same time when the browsers will stop to read the align
attributes? If this is true why not simply let the browsers decide if they are
fine with an align attribute or not?

Someone should get around to cleaning up the rest of the invalid stuff within a few releases. It's probably not going to take even 2 years, much less 10.

Why do you think it's your responsibility
to make this decision? It wasn't your responsibility till last week. All you
did was whitelisting. The align attributes were copied to the output or not.
Simple. What you are doing now is to *guess* the meaning of these attributes
and to replace it with something else that may or may not have the same
meaning.

It was our responsibility to do this since far before last week. The fact that we didn't wasn't an assertion that it wasn't our responsibility. We just got around to trying to fix a bug that needed to be fixed.

It's not your responsibility to change the meaning of my code.

All the current hack does is to add confusion.

I've mentioned it multiple times already. It's our responsibility to both make sure MediaWiki outputs valid markup and make sure that when we change the handling of WikiText we don't just break syntax itself that used to work entirely without good reason. In this situation that means transforming WikiText attributes like align into css so that the same WikiText still works for most use-cases. In this case it means transforming align="center" to style="text-align: center;" on everything except <table>. The CSS does not center nested block elements, but most instances of align are used to center inline text. So this allows valid markup that still works in most cases but may need a few tweaks to WikiText in cases where a block element was relied on (ie: probably adding auto-margins to it).

mr.heat wrote:

(In reply to comment #25)

We use HTML5.

No, you don't. You still output out-dated garbage like <center> and <font>. I consider a single <center> or <font> tag a *lot* more dangerous than thousands of align attributes.

I wouldn't be here if your decision would make sense. But it does not. As long as you do *not* output valid HTML5 all of your arguments are irrelevant.

As long as you output <center> tags there is no reason to not output align attributes.

Simple.

Dropping a few random snippets for being "invalid" and breaking them the same time is just lazy. All it does is adding confusion. I think I said that multiple times now.

WikiText is not HTML

No, that's not true and you know it. Things like <small> tags or class attributes are simply whitelisted in the MediaWiki parser. When I write something like <small> it actually *is* HTML. When I wrote align="center" in a template it actually was HTML till last week.

Now it's not HTML any more. Instead it has a special meaning in WikiText. Since last week it became *different* from what it was the week before.

What you did was simply dropping a feature. Instead you are outputting something that may or may not be intended by the template developer.

Dropping a feature requires community consensus.

(In reply to comment #26)

(In reply to comment #25)

We use HTML5.

No, you don't. You still output out-dated garbage like <center> and <font>. I
consider a single <center> or <font> tag a *lot* more dangerous than thousands
of align attributes.

I wouldn't be here if your decision would make sense. But it does not. As long
as you do *not* output valid HTML5 all of your arguments are irrelevant.

As long as you output <center> tags there is no reason to not output align
attributes.

Those are bugs, and they should be fixed. The project of fixing invalid markup not being finished is NOT a valid reason to say that invalid markup should not be fixed. That's circular reasoning. When you have two bugs and you fix one. You work towards fixing the other. You don't re-introduce the first bug because the second one has not been fixed.
Frankly if I had the time, I'd just go and fix that bug right now.

And yes we do use HTML5. We output HTML5 doctypes. We follow HTML5 markup rules. We support HTML5 elements, attributes, and features. And we have an open tracking bug tracking issues with our HTML5 output that should be fixed. Just because we have a bug or two that needs fixing does not mean we're not using HTML5. Rather the fact that we actually consider those bugs is an indication HTML5 is our target.

Simple.

Dropping a few random snippets for being "invalid" and breaking them the same
time is just lazy. All it does is adding confusion. I think I said that
multiple times now.

Then instead of crusading with ridiculous arguments like saying we're not using HTML5 and HTML5 is unfinished and shouldn't be used try pushing for the non-lazy way to fix invalid html.

There have been different ideas on the mailing list. Some ideas how we might fix the block alignment. Others on ways we might eliminate use of deprecated attributes in the long run. The discussion on how to deal with invalid markup in the long run doesn't even appear to have closed.

WikiText is not HTML

No, that's not true and you know it. Things like <small> tags or class
attributes are simply whitelisted in the MediaWiki parser. When I write
something like <small> it actually *is* HTML. When I wrote align="center" in a
template it actually was HTML till last week.

Now it's not HTML any more. Instead it has a special meaning in WikiText. Since
last week it became *different* from what it was the week before.

WikiText is not HTML. WikiText is a loose page authoring syntax. We take in text that people write using simple patterns and try to format the text using the best output HTML we can. Starting a line with * is a good way to make an unordered list, so we turn those into <ul><li>'s. [[Foo]] is a good way to make a link so we turn those into <a>. Likewise {| makes a good syntax for tables.
Some people want to output certain HTML elements into their pages. Such as a HTML <div>. And others want to output a <ul> in ways we can't do with * syntax. We can't really cover all the ways to do something using some other syntax. So for the WikiText we support to output those HTML elements we support a subset of the same syntax HTML uses to output them. So while it's a simila format, that's WikiText, not HTML.

What you did was simply dropping a feature. Instead you are outputting
something that may or may not be intended by the template developer.

Dropping a feature requires community consensus.

As I mentioned before from the very start this was supposed to be a working translation from dropped html to css. The fact the visual look of the output would change was not known. Because it did not change when the feature was first tested. It only changed in edge cases that were not known. Invalid markup in the output is not a feature. There was no feature being dropped.

The issues only came out later as bugs. And not just later, they were not pointed out until after 1.19 was already released. Even though this change should have already been live on the various Wikipedias and all their thousands of templates with syntaxes all over the place.


A few things you may have missed:

  • CSS translation functionality is controlled by $wgCleanupPresentationalAttributes. It can be controlled per-wiki.
  • And we actually in fact have had people requesting this feature. So this feature definitely isn't going away entirely.
  • MediaWiki defaults are not based on WMF setup. ie: We have settings where we add them. Set one default in MediaWiki, and use a different value on WMF's wikis.
  • What you should really be pushing for is not removal of this. But setting $wgCleanupPresentationalAttributes either as a mw default (if 3rd party wikis are your concern) or as a site request (if this is affecting WMF wiki). Though if you're going to do the latter please open a separate bug since this one is a little to generically technical.
  • How to handle invalid markup both short-term and in the long run is a wikitech-l list discussion. It hasn't finished. (Hence why this bug is left in a ephemeral state instead of just being WONTFIXED right now)

Side note. I'm considering changing that setting a little to a 3-state switch instead of a boolean. ie: "Pass invalid markup to the browser." (current false), "Transform to valid markup." (current true), or "Drop invalid markup." (a new option). That way individual wikis also have the option to drop bad markup in WikiText entirely.

Another option might be to set the default to drop invalid markup entirely and get existing wikis to set it differently. So we can get future wikis to just not use the invalid markup entirely while old wikis have compatibility and can transition. Though that idea should be discussed on the mailing list.

(In reply to comment #27)

Then instead of crusading with ridiculous arguments like saying we're not using
HTML5 and HTML5 is unfinished and shouldn't be used try pushing for the
non-lazy way to fix invalid html.

I think this is fair. I'm inclined to file two bugs here:

(1) Kill $wgCleanupPresentationalAttributes from MediaWiki core; "magical" transformations like this are almost always an absolutely terrible idea and should only be done when absolutely necessary (such as stripping out dangerous attributes); you're never going to catch every (edge) case and you're going to do more harm than good (and create confusion about what's magically transformed and under what conditions); I thought we established these principles with magical transformations years ago.

(2) Add a tracking category for pages using invalid HTML5; this should be straightforward enough; allow wiki authors to slowly fix these pages to use better code by auto-categorizing pages that use invalid HTML5.

This bug can then rot, as far as I'm concerned.

(In reply to comment #28)

(In reply to comment #27)

Then instead of crusading with ridiculous arguments like saying we're not using
HTML5 and HTML5 is unfinished and shouldn't be used try pushing for the
non-lazy way to fix invalid html.

I think this is fair. I'm inclined to file two bugs here:

(1) Kill $wgCleanupPresentationalAttributes from MediaWiki core; "magical"
transformations like this are almost always an absolutely terrible idea and
should only be done when absolutely necessary (such as stripping out dangerous
attributes); you're never going to catch every (edge) case and you're going to
do more harm than good (and create confusion about what's magically transformed
and under what conditions); I thought we established these principles with
magical transformations years ago.

What do you think of a new setting to disable all these invalid tags and attributes. Defaulting it on and having WMF and existing wikis turn it off. So that on new wikis it won't even be supported from the start?

(2) Add a tracking category for pages using invalid HTML5; this should be
straightforward enough; allow wiki authors to slowly fix these pages to use
better code by auto-categorizing pages that use invalid HTML5.

This bug can then rot, as far as I'm concerned.

content hidden as private in Bugzilla

Okay, lets try to get this thing back on track and work out some realistic
goals. 1.20 release is coming up soon.

  • $wgCleanupPresentationalAttributes will be removed from core (master and

REL1_20), it is currently broken and to be considered an unacceptable
regression.

  • As for the future (whether or not working on improving it and bringing it

back). Changing the output of custom wikitext markup (e.g. changing the
wikitext table syntax {| output to create <thead> sections) is one thing.
Though that is still something to be very cautious about, it can be useful.
However changing the output of simple HTML is quite another.

Changing that must never exceed the boundaries of security and normalization.
Anything else is imho by definition a bad idea. Trying to extract the meaning
of html attributes in an automated fashion to try and update it is a lost
cause. Not only will it cause confusion (do we support it or not? And if so,
why are we fixing unsupported stuff? Are we going to auto-migrate everything
that has been be deprecated in some W3C HTML specification? Then we'll have
to migrate HTML3.2 bgcolor="" as well (deprecated in HTML4.01).

Aside from being confusing, it can also cause bugs because we're changing
markup. That means that certain CSS selectors may no longer apply (`.foo center
.whatever { color: pink; }`). Or certain JavaScript modules will start failing
in weird unexpected ways (firstChild.nextNode.nodeName.toLowerCase() ===
'center'), or the layout will change due to the difference in weight of css rules.
Weight of inline styles vs. deprecated attributes vs. placeholder classes we
may substitute. As well as in which order they would merge if we try to fix them.
And the list of possible problems goes on.

Now if there was actually some kind of victory at the end of this quest, it
could be worth pursuing. But surely we don't consider passing the HTML5
validator a victory? All browsers we support support these attributes, they
likely do so because old documents (and old habits) die hard and they will be
served through newer software frameworks. This all works just fine and there is
nothing to be concerned about. The validator is a tool, no more no less. It is
not a browser and we don't have to support it everything it says.

(In reply to comment #31)

  • $wgCleanupPresentationalAttributes will be removed from core (master and

REL1_20), it is currently broken and to be considered an unacceptable
regression.

I've split this out to bug 40632 ("Kill $wgCleanupPresentationalAttributes from MediaWiki core").

Now if there was actually some kind of victory at the end of this quest, it
could be worth pursuing. But surely we don't consider passing the HTML5
validator a victory? All browsers we support support these attributes, they
likely do so because old documents (and old habits) die hard and they will be
served through newer software frameworks. This all works just fine and there is
nothing to be concerned about. The validator is a tool, no more no less. It is
not a browser and we don't have to support it everything it says.

I've split this out to bug 40633 ("Auto-categorize pages that contain invalid HTML"). I believe, if possible, auto-categorizing pages that contain invalid HTML (attributes or elements) would be beneficial to wiki editors. Knowledge is power, or something.

Lowering priority of this bug in favour of bug 40632.

Feature removed (bug 40632) in Change-Id: I4e86305520a3b22ef88381caab55d24abac932e3.

Landed in master and merged to 1.20.x.