Page MenuHomePhabricator

kmlexport tool encoding problem
Closed, ResolvedPublic

Description

Hello, kmlexport tool is broken. For too many articles on Czech Wikipedia it produces broken KML file (with wrong encoding). Because it is used by some other map tools, these tools for those articles are broken too. You can see the broken result for the following example articles:

Also you can look on the Mapa souřadnic (GeoGroup) template in these pages, what kmlexport output looks like in some derived map tools getting the result from the tool. osm4wiki for OpenStreetMap can show the broken result, but with broken encoding, but mapycz for Windymaps and wm-world for Google Maps are broken at all (even can not show a map, which is shown when empty KML file is used).

Could somebody please help? It does not work correctly on 50 % of articles. I asked the author, but he isn't very active and after a month there is still no response, Perhaps he'll answer here. As there is no other tool, which could extract all coordinates from an article, fixing this tool is crucial for the Czech Wikipedia (and maybe some others).

I think the issue can be in two things: Maybe some Toolforge settings related to Perl, cronjob, webserver, article access or kmlexport changed and the encoding handling is different than in the past. Or maybe there is some encoding issue inside tool's source code – published under WTFPL (T92963#2265237) license (Tools). This is why I added both tags, is that right?

Event Timeline

Dvorapa updated the task description. (Show Details)
Dvorapa added subscribers: Kolossos, rafidaslam, Teslaton.

The tags look ok to me, and you've already added @Para to the cc's, so now we just need them to respond. If @Para does not respond you also have the option of invoking the abandoned tools policy (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Abandoned_tool_policy) and taking up the work yourself.

For Wiki projects there is a hack available as kmlTitle which permits templates and Lua modules to overcome the current situation until URL decoding for higher UTF-8 has been solved.

adding a specific detail: see
https://tools.wmflabs.org/kmlexport?project=de&article=Liste_der_denkmalgesch%25C3%25BCtzten_Objekte_in_Eibiswald
(I could not paste example code here)

it contains an ill-formed href link:

<description><![CDATA[<br>Source: Wikipedia article <a href="https://de.wikipedia.org/wiki/Liste_der_denkmalgeschützten_Objekte_in_Eibiswald#Kruzifix/Kreuz,_"Kreuzigungsgruppe",_bei_Hxxdorf_23">Liste der denkmalgeschützten Objekte in Eibiswald</a>]]></description>

If the title property contains quotes, the link in kmlexport is written as above, which causes tools reading this output to interpret the link as only

"https://de.wikipedia.org/wiki/Liste_der_denkmalgeschützten_Objekte_in_Eibiswald#Kruzifix/Kreuz,_"

so the request is that quotes inside titles have to be properly escaped:
<description><![CDATA[<br>Source: Wikipedia article <a href="https://de.wikipedia.org/wiki/Liste_der_denkmalgeschützten_Objekte_in_Eibiswald#Kruzifix/Kreuz,_&quot;Kreuzigungsgruppe&quot;,_bei_Hxxdorf_23">Liste der denkmalgeschützten Objekte in Eibiswald</a>]]></description>

I made a patch, which should fix the issues:


Could someone with access update kmlexport?

bd808 claimed this task.
bd808 subscribed.

I applied @Dvorapa's patch, but the unescaped quote problem persisted. I think I have fixed it by adding a uri_escape_utf8() call on the new line 469 when passing $anchor to printplacemark().

Example:

<description><![CDATA[<br>Source: Wikipedia article <a href="https://de.wikipedia.org/wiki/Liste_der_denkmalgeschützten_Objekte_in_Eibiswald#Kruzifix%2FKreuz%2C_%22Kreuzigungsgruppe%22%2C_bei_H%C3%83%C2%B6rmsdorf_23">Liste der denkmalgeschützten Objekte in Eibiswald</a>]]></description>