Page MenuHomePhabricator

Auto-categorize pages that contain invalid HTML
Open, LowPublic

Description

Splitting this from T42329 (bug 40329), MediaWiki should, if possible, auto-categorize pages that contain invalid HTML attributes or elements. This will allow diligent and caring wiki editors to improve the code that these pages uses, if they feel inclined to.

Details

Reference
bz40633

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:50 AM
bzimport set Reference to bz40633.
bzimport added a subscriber: Unknown Object (MLST).

We first need a system to record any invalid HTML, and I would prefer we do not use categories for that but a special page instead.

I am wondering how we will be able to report that error Foo is happening at line XXX, character YYYY.

mr.heat wrote:

Don't make this to complicated. A simple category or special page "ContainsDeprecatedFeature" is enough. Use a generic name so it can be used for everything including deprecated parser functions and such. Add a possibility to filter the special page by namespace if you can. Done.

Start with an empty list of deprecated features. Add one feature at a time. I suggest the <font> tag. Let the Wikipedia community know when that specific HTML tag or attribute will be dropped. I suggest something between 3 and 12 months. The community will do the work.

If the special page is almost empty in most Wikipedia languages add the next feature to the list. I suggest the <center> tag. Then <stroke>. Then <big>. Then <tt>. The last one will be align="..." and valign="..." because it's most used and therefor requires a lot of work to replace. (Note: As explained in bug 40329 it is *NOT* possible to simply replace all align="..." with text-align: ... Replacing such stuff always requires a user to look at the code and to understand what it does. Some replacements can be done with a semi-automatic bot or user scripts. But it never should be done by the MediaWiki software.)

We are currently trying to collect possible replacements: http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_HTML5 We would like to start but we need a possibility to search for "<font", for example.

My point is that categories should probably not be used as a way to add metadata on articles. Moreover the category table is really huge :/

(In reply to comment #1)

We first need a system to record any invalid HTML, and I would prefer we do not
use categories for that but a special page instead.

How do you envision that working? The benefit to categorization is that you can have "lazy-loading": when a page is reparsed, it can be auto-categorized. How would a Special page work?

I am wondering how we will be able to report that error Foo is happening at
line XXX, character YYYY.

I'm not sure that's necessary.

(In reply to comment #3)

My point is that categories should probably not be used as a way to add
metadata on articles.

Umm, can you expand on this point, please? Categories are _classic_ page metadata, aren't they?

Moreover the category table is really huge :/

And?

MediaWiki-extensions-Linter will create lists of some specific kinds of invalid HTML. And we also have https://en.wikipedia.org/wiki/Category:Pages_using_invalid_self-closed_HTML_tags.

"Invalid HTML" isn't a clear description. Do we want to track anything that isn't adhering to the HTML5 spec? Or what browsers actually implement?

Krinkle removed a subscriber: Krinkle.