Page MenuHomePhabricator

Parse bare attributes according to the html5 spec
Closed, ResolvedPublic

Description

Currently,

<div class=  style="123">hi</div>

will parse as,

<div class="" style="123">hi</div>

but should yield,

<div class="style=\"123\"">hi</div>

Event Timeline

Arlolra claimed this task.
Arlolra raised the priority of this task from to Medium.
Arlolra updated the task description. (Show Details)
Arlolra added a subscriber: Arlolra.
cscott set Security to None.
cscott moved this task from To Triage to Not ready to announce on the User-notice board.

Change 267206 had a related patch set uploaded (by Arlolra):
T108134: Match html5 unquoted attribute parsing

https://gerrit.wikimedia.org/r/267206

Change 229453 had a related patch set uploaded (by Arlolra):
Match html5 unquoted attribute parsing

https://gerrit.wikimedia.org/r/229453

Change 229453 merged by jenkins-bot:
Match html5 unquoted attribute parsing

https://gerrit.wikimedia.org/r/229453

This will likely go out as part of wmf15 which is scheduled for March 1-3 as per https://www.mediawiki.org/wiki/MediaWiki_1.27/Roadmap

Change 267206 merged by jenkins-bot:
T108134: Match html5 unquoted attribute parsing

https://gerrit.wikimedia.org/r/267206

I'm writing the next Tech News issue: how can you describe that change for an ignoramus like me? :)

Leaving a dangling equals sign (ex. <div class= test="123">) without providing a value isn't going to result in the parse you were expecting. Either explicitly add an empty quoted string (ex <div class="" test="123">) or use the empty attribute syntax (ex <div class test="123">).

@ssastry: @Tpt announced this on the frwikisource village pump: https://fr.wikisource.org/wiki/Wikisource:Scriptorium/F%C3%A9vrier_2016#Changement_mineur_du_parser_.28technique.29

Although it sounds like there's going to be some pushback, "je ne serais pas surpris si cette modification est très rapidement annulé." (I wouldn't be surprised if it was quickly reverted)

@Arlolra @Tpt https://gist.github.com/subbuss/e4b10c4c36cf09b4842c has the list of 687 pages that need fixing on frwikisource. Could one of you update the village pump posting with this information. I don't speak/write french. :-)

matmarex added a subscriber: matmarex.

Looks done to me?

Hi, if the list of pages

@Arlolra @Tpt https://gist.github.com/subbuss/e4b10c4c36cf09b4842c has the list of 687 pages that need fixing on frwikisource. Could one of you update the village pump posting with this information. I don't speak/write french. :-)

Don't use this list to estimate the extent of page broken by this change, I used it to fix fr.wikisource pages ,then found other pages broken so I end up to pass the bot through the whole ns:0, run is not yet finished, around 4000 pages to fix.

So now some THOUSANDS of pages on it.wikisource and elsewhere are BROKEN, thanks to this change. And nobody even cared to tell us in advance. Great.

@Phe, can you run your bot on the other subdomains as well, and/or give me the script so I can run it myself to speed up things? Thank you.

I'm writing the next Tech News issue: how can you describe that change for an ignoramus like me? :)

@Trizek-WMF. I just saw the tech news announcement. This change is not specific to Parsoid, It affects the core PHP parser as well which is why frwikisource (and now itwikisource which we were not aware of) were affected as well. Could you update the announcement for next week?

So now some THOUSANDS of pages on it.wikisource and elsewhere are BROKEN, thanks to this change. And nobody even cared to tell us in advance. Great.

Our mistake .. we didn't realize this was affecting other wikisource projects as well. We'll update our testing methodology to include test pages from multiple wikisource projects as well to prevent this in the future.

Trizek-WMF raised the priority of this task from Medium to High.

@ssastry, Johan is going to announce that on the next Tech News. It is unfortunate that you didn't notice my message in time. Also, Arlolra message concerning pushing back should have warn you about the impact of that change.

I reopen that task to keep a track. What are we supposed to add to Tech News? I also hope a fix will be provided to Wikisource soon.

I will be happy to add an explanation to Tech News. How would you suggest to phrase it?

So now some THOUSANDS of pages on it.wikisource and elsewhere are BROKEN, thanks to this change. And nobody even cared to tell us in advance. Great.

@Phe, can you run your bot on the other subdomains as well, and/or give me the script so I can run it myself to speed up things? Thank you.

Note the first regex fix 'from= "' to 'from=' and is only here to easier the works of the second regex, regex to use is wiki dependent as some wiki can add their own attributes to <pages

$ python replace.py -lang:it -family:wikisource '-start:!' -regex ' (current|fromsection|tosection|onlysection|to|from|header)=[ ]+"' ' \1="' ' (current|fromsection|tosection|onlysection|to|from|header)= ' ' ' -allowoverlap

You'll need also to patch pywikipedia/pywikibot/textlib.py

diff --git a/pywikibot/textlib.py b/pywikibot/textlib.py
index 41c61ab..2fc0b45 100644
- - - a/pywikibot/textlib.py
+++ b/pywikibot/textlib.py
@@ -234,7 +234,7 @@ def replaceExcept(text, old, new, exceptions, caseInsensitive=False,
             # continue the search on the remaining text
             if allowoverlap:
-                index = match.start() + 1
+                index = match.start() 
             else:
                 index = match.start() + len(replacement)
             markerpos = match.start() + len(replacement)

I will be happy to add an explanation to Tech News. How would you suggest to phrase it?

Bare attributes in tags now parse according to the HTML5 spec. So, <pages from= to= section=1> will parse as <pages from="to=" section=1"> now rather than <pages from="" to="" section="1"> as it used to till now. Please use <pages from="" to="" section=1> or <pages section=1> instead. Based on feedback, this is mostly likely to affect pages on wikisource projects.

Not sure if that is too wordy. But, that covers the issue.

@ssastry, Johan is going to announce that on the next Tech News. It is unfortunate that you didn't notice my message in time. Also, Arlolra message concerning pushing back should have warn you about the impact of that change.

@Trizek-WMF, we were aware of the impact of the change on frwikisource and worked with @Tpt and added announcements and updates to the village pump which @Phe handled with a bot before the fix rolled out. Our oversight was in assuming that this breakage was restricted to frwikisource.

I reopen that task to keep a track. What are we supposed to add to Tech News? I also hope a fix will be provided to Wikisource soon.

Looks like @Phe has already provided the bot script. Thanks very much!

@Trizek-WMF, we were aware of the impact of the change on frwikisource and worked with @Tpt and added announcements and updates to the village pump which @Phe handled with a bot before the fix rolled out. Our oversight was in assuming that this breakage was restricted to frwikisource.

Dangerous bet ^^

@Trizek-WMF, we were aware of the impact of the change on frwikisource and worked with @Tpt and added announcements and updates to the village pump which @Phe handled with a bot before the fix rolled out. Our oversight was in assuming that this breakage was restricted to frwikisource.

Dangerous bet ^^

Indeed. But, note that we did actually run tests on 160K pages from 30 different wikis (and found ~5 pages or so that were affected). What we know now is that wikisource pages were not part of this test set. So, we'll definitely fix that part by including pages from a few wikisource projects (and wikivoyage while we are at it).

Anyway, all I am saying is that were not going fully blind on this. We did run tests and we did anticipate breakage on frwikisource. But yes, we did make a mistake in assuming that this was restricted to one wikisource project. Later today, we'll take a look at which other wikisource projects might be affected.

@Trizek-WMF, we were aware of the impact of the change on frwikisource and worked with @Tpt and added announcements and updates to the village pump which @Phe handled with a bot before the fix rolled out. Our oversight was in assuming that this breakage was restricted to frwikisource.

Dangerous bet ^^

Indeed. But, note that we did actually run tests on 160K pages from 30 different wikis (and found ~5 pages or so that were affected). What we know now is that wikisource pages were not part of this test set. So, we'll definitely fix that part by including pages from a few wikisource projects (and wikivoyage while we are at it).

Anyway, all I am saying is that were not going fully blind on this. We did run tests and we did anticipate breakage on frwikisource. But yes, we did make a mistake in assuming that this was restricted to one wikisource project. Later today, we'll take a look at which other wikisource projects might be affected.

Thank you for that clarification. :)
And my apologies for being a little bit aggressive.

I'll update this with other entries as I generate them.

https://gist.github.com/subbuss/b401a7345a833f73ba37 is the list of 45 pages I found by grepping the latest enwikisource dump.
https://gist.github.com/subbuss/e9b2e58c284d95f75959 is the list of 77 pages I found by grepping the latest eswikisource dump.
https://gist.github.com/subbuss/fb2473b106b86874b610 is the list of 736 pages I found by grepping the latest itwikisource dump

The grep command I used for frwikisource was not as accurate, but I think this new one is better.

Looks like at least for itwikisource, it would be simpler to just run a bot.

<pages from= to= section=1> will parse as <pages from="to=" section=1"> now rather than <pages from="" to="" section="1"> as it used to till now.

Just to be sure: that is that supposed to be section=1" and not section="1"?

<pages from="to=\" section=1">

from is the attribute name
to=" section=1 is the value

Thanks. To clarify: the reason I'm asking is the odd number (3) of quotation marks in <pages from="to=" section=1">.

@Johan .. sorry it is indeed <pages from="to=" section="1"> .. I forgot one quote in my hasty response. Thanks for catching it.

@Arlolra see below:

[subbu@earth bin] echo '<div id= class= title=test>x</div>' | parse.js --normalize
<div id="class=" title="test">x</div>

Verified in the enwiki sandbox as well to confirm.

Wow, I really didn't read that right. Sorry.

This change was the reason for an unexpected rendering of a WP model reported and corrected on the French WP Village Pump, and it was mentioned there that in order to know how much maintenance work this will need on WP, a count of such bare attributes for frwiki like it was done for frwikisource would be appreciated.

This change was the reason for an unexpected rendering of a WP model reported and corrected on the French WP Village Pump, and it was mentioned there that in order to know how much maintenance work this will need on WP, a count of such bare attributes for frwiki like it was done for frwikisource would be appreciated.

We ran tests on 10000 pages from frwiki and zero pages were affected. So, we expect a very small number of pages to be affected.

I just did a quick check on the latest XML-dump for nowiki, found eight pages affected and fixed them. I also scanned frwiki just for fun and found 26 articles, see the list here: http://tools.wmflabs.org/pagecount/articles_with_missing_quotes-frwiki.html

So now some THOUSANDS of pages on it.wikisource and elsewhere are BROKEN, thanks to this change. And nobody even cared to tell us in advance. Great.

@Phe, can you run your bot on the other subdomains as well, and/or give me the script so I can run it myself to speed up things? Thank you.

Note the first regex fix 'from= "' to 'from=' and is only here to easier the works of the second regex, regex to use is wiki dependent as some wiki can add their own attributes to <pages

$ python replace.py -lang:it -family:wikisource '-start:!' -regex ' (current|fromsection|tosection|onlysection|to|from|header)=[ ]+"' ' \1="' ' (current|fromsection|tosection|onlysection|to|from|header)= ' ' ' -allowoverlap

You'll need also to patch pywikipedia/pywikibot/textlib.py

diff --git a/pywikibot/textlib.py b/pywikibot/textlib.py
index 41c61ab..2fc0b45 100644
- - - a/pywikibot/textlib.py
+++ b/pywikibot/textlib.py
@@ -234,7 +234,7 @@ def replaceExcept(text, old, new, exceptions, caseInsensitive=False,
             # continue the search on the remaining text
             if allowoverlap:
-                index = match.start() + 1
+                index = match.start() 
             else:
                 index = match.start() + len(replacement)
             markerpos = match.start() + len(replacement)

@jayvdb for pywikibot

@Stigmj on frwiki we now have to fix in this way templates that may be transcluded on thousands of pages (sorry I wrote model above but meant template = modèle in French), finding all such cases will probably not be easy due to the many syntactic possibilities of templates and modules.

@Stigmj on frwiki we now have to fix in this way templates that may be transcluded on thousands of pages (sorry I wrote model above but meant template = modèle in French), finding all such cases will probably not be easy due to the many syntactic possibilities of templates and modules.

You only have to amend empty values, so the syntax would seem to be less pertinent as a value is empty or not, ie each ...= is changed to ...="" While I am sure that we have broken pages at enWS to fix, I have yet to see any. We seem to have quoted our attributes.

I'm going to cautiously reclose this, now that it has been quiet again for a few days.

Sorry for being a little late and commenting in a closed ticket. I browsed the comments and did not get how i could determined and automatically repair problematic pages. I'd like to do that for dewikisource. Or is this already done for all wikisources? Thanks...