"pageviews" tool link at bottom creates invalid HTML syntax (due to missing HTML encoding of characters like '&')
Closed, ResolvedPublic

Description

Zu den großartigen features der deutschen Wikipedia gehört, dass alle Seiten wohlgeformten Code haben und dadurch gut und leicht auch mit XML-Technologien weiter ausgewertet und verarbeitet werden können. Dieses Prinzip scheint neuerdings durch die Einbindung eines geänderten Links zur Abrufstatistik zerstört zu werden. Der Seitenquellcode enthält jetzt

<a class="external" href="https://tools.wmflabs.org/pageviews#pages=Kategorie:Staat_als_Thema&project=de.wikipedia.org" rel="nofollow">Abrufstatistik</a>

womit die Seiten nicht mehr wohlgeformt und nicht mehr parsebar sind. Wenn statt "&project" einfach "&amp;project" oder "%26project" da stehen würde, wäre wieder alles gut.

Wir (z.B. an der Universität zu Köln) benutzen sehr oft XML-Technologien zur Analyse von wikipedia-Inhalten, sowohl in der Forschung als auch in der Lehre (gerade verzweifeln Studierende daran, dass die wikipedia nicht mehr wohlgeformt ist). Es wäre ein sehr trauriger Rückschritt, wenn die Wohlgeformtheit der Seiten einfach so aufgegeben würde.
Siehe dazu zuletzt z.B.: Sahle/Henny, Klios Algorithmen: automatisierte Auswertung von Wikipedia-Inhalten als Faktenbasis und Diskursraum. In: Wikipedia und Geschichtswissenschaft. Hg. von Thomas Wozniak, Uwe Rohwedder und Jürgen Nemitz. München: De Gruyter Oldenburg, 2015, S. 113-148.

Viele Grüße, Patrick Sahle

PatoLogic updated the task description. (Show Details)
PatoLogic raised the priority of this task from to Needs Triage.
PatoLogic added a subscriber: PatoLogic.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 15 2016, 1:53 PM
Restricted Application added a project: TCB-Team. · View Herald TranscriptFeb 15 2016, 2:09 PM

tools.wmflabs.org lists @MusikAnimal as maintainer of pageviews, hence also CC'ing (feel free to remove yourself again).

If I get the (German) description correctly, characters like & are not HTML encoded and hence trigger syntax validation errors on Wikipedia pages, due to embedding this code on Wikipedia pages:
<a class="external" href="https://tools.wmflabs.org/pageviews#pages=Kategorie:Staat_als_Thema&project=de.wikipedia.org" rel="nofollow">Abrufstatistik</a>
instead of e.g.
<a class="external" href="https://tools.wmflabs.org/pageviews#pages=Kategorie:Staat_als_Thema&amp;project=de.wikipedia.org" rel="nofollow">Abrufstatistik</a>

I'm not sure if that link is added by some gadget or some beta function (for both I didn't find any related setting on https://de.wikipedia.org/wiki/Spezial:Einstellungen ) or if that's somehow hardcoded in the MW Core codebase...

(And personally I'm puzzled why this link was added under "footer-info-copyright" but that's another topic.)

Aklapper renamed this task from Externes tool zur Abrufstatistik zerschießt xhtml-Wohlgeformtheit to "pageviews" tool link at bottom creates invalid HTML syntax (due to missing HTML encoding of characters like '&').Feb 15 2016, 2:19 PM
  • sorry for the German, I thought this was a problem only relevant to de.wikipedia
  • it's surely not hardcoded but must be the result of some automatism since it occurs on every page, as far as I can see
  • my biggest problem has been and still is, to find out who to approach for such a case; I still wonder whether the phabricator is the right place and how to tag te ticket to get it into the right channels

I understand this is an on-wiki issue on dewiki, and not with the pageviews tool in particular? Could you give an example of a page that contains one of these links?

When linking to external sites you should only need to URL encode parameters passed to them, not the entire URL. I'm only guessing that's what's going on here. E.g. the "page information" correctly links to the pageviews tool, using {{FULLPAGENAMEE}} (encoded page name) in place of where the page name goes, and not encoding the entire URL.

As I said, to me it seems that "every" page has the link "Abrufstatistik" with the problematic "&project" string in the URL.
Just an arbitrary example: https://de.wikipedia.org/wiki/Extensible_Markup_Language.
On that page there are dozens of URLs that contain parameters which are masked as entities (&amp;), The Abrufstatistik link is the only one that breaks the rule

I guess that somebody has changed the way that link is inserted into the pages and has overlooked this tiny problem?

But who can be approached to change this?

Let's first try to reliably reproduce the issue.

As I said, to me it seems that "every" page has the link "Abrufstatistik" with the problematic "&project" string in the URL.
Just an arbitrary example: https://de.wikipedia.org/wiki/Extensible_Markup_Language.
On that page there are dozens of URLs that contain parameters which are masked as entities (&amp;), The Abrufstatistik link is the only one that breaks the rule

The "Abrufstatistik" link is working for me, I tried it in Chrome, Firefox and Safari on OSX. What browser/OS are you using? Are there any "gadgets" enabled that might alter external URLs?

you got me wrong: the link "works" but its syntax in the html codes breaks the wellformedness with the effect that pages from de.wikipedia connot be harvested, parsed, analyzed, reused by XML-related technologies any more. That's a huge problem for people that use the wikipedia http-access as some sort of an API!
I have, for example, students working on course project that developed software code to analyze de.wikipedia content until last week but cannot go on with that. Happy are those who work with en.wikipedia ...

"Let's first try to reliably reproduce the issue." - please take a look at the source code of an arbitrary page in de.wikipedia
It's well formed with only the one exception ...

I'm sorry I do not follow... I inspected the source of https://de.wikipedia.org/wiki/Extensible_Markup_Language, looking at the "Abrufstatistik" link at the bottom. I see https://tools.wmflabs.org/pageviews#pages=Extensible_Markup_Language&project=de.wikipedia.org which is correct.

If you're saying that link has &amp;project= instead of &project= then I'm led to believe there's some sort of script or gadget that's altering the link. I am logged in with default gadgets, no extra scripts. It could also be a plugin or extension of your browser. Just throwing it out there as a possibility. Sorry I am having trouble reproducing the issue!

(And personally I'm puzzled why this link was added under "footer-info-copyright" but that's another topic.)

A (bad) hack from 2012 as a try to add a visible link to stats.grok.se. But the people like the link at the end of the page. Maybe now there would be a better place?

@MusikAnimal: you reproduced the issue correctly, but you say "I see https://tools.wmflabs.org/pageviews#pages=Extensible_Markup_Language&project=de.wikipedia.org which is correct.". No, it's not correct as regards conformity to the HTML standard and wellformedness. To be wellformed and standard compliant it has to be either "&amp;" or "%26". I described the far reaching consequences above.

@Raymond: I am so grateful for that!!!! Right now I don't see it taking effect, however. Maybe it takes some time?

@Raymond: sorry, as of 2016-02-17, 8:43 CET, I DO see the effect of the change, hurray!

this issue may be closed now. Thanks everybody.

Raymond closed this task as Resolved.Feb 17 2016, 7:58 AM
Raymond claimed this task.
Tobi_WMDE_SW moved this task from Incoming to Done on the TCB-Team board.Mar 17 2016, 2:01 PM