Page MenuHomePhabricator

Decoding URL results in Query Service
Closed, ResolvedPublic3 Estimated Story Points

Description

As a Query Service user I want to have decoded URLs displayed in my query results in order to easily read them.

Problem:
Currently when running a query on the Query Service (example) some of the URLs displayed are very long and difficult for humans to read.

This ticket is to reduce the length of those URLs and make them more legible to humans by decoding them.

Example:

Decoding the displayed URL, so that a URL like:

<https://bg.wikiquote.org/wiki/%D0%92%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B0_%E2%80%94_%D0%9A%D0%BE%D1%81%D0%BC%D0%BE%D1%81_%E2%80%94_%D0%A1%D0%B2%D0%B5%D1%82%D0%BE%D0%B2%D0%B5>

turns into

<https://bg.wikiquote.org/wiki/Вселена_—_Космос_—_Светове>

Screenshots/mockups:

Current display of URLs:

image.png (517×743 px, 80 KB)

Acceptance criteria:

  • URL results on the Query Service are decoded

Open questions:
Are there some special characters we don’t want to decode, e.g. RTL override?
Could an automatic <wbr> insertion cause issues?

Original ticket

  1. URLs are fixed sets of characters that won't wrap on their own. We should apply word-break: break-word; to the QS table in order to allow these long values to break into subsequent lines based on the available column space. This will prevent the list display to be triggered by the lack of horizontal space in this case.

I would suggest decoding the displayed URL, so that a URL like

<https://bg.wikiquote.org/wiki/%D0%92%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B0_%E2%80%94_%D0%9A%D0%BE%D1%81%D0%BC%D0%BE%D1%81_%E2%80%94_%D0%A1%D0%B2%D0%B5%D1%82%D0%BE%D0%B2%D0%B5>

turns into

<https://bg.wikiquote.org/wiki/Вселена_—_Космос_—_Светове>

(a third of the original length and actually legible to humans) and then use <wbr> at appropriate points in the URL (see https://css-tricks.com/better-line-breaks-for-long-urls/), e.g.

<https:<wbr>//<wbr>bg<wbr>.<wbr>wikiquote<wbr>.<wbr>org<wbr>/<wbr>wiki/<wbr>Вселена<wbr>_<wbr>—<wbr>_<wbr>Космос<wbr>_<wbr>—<wbr>_<wbr>Светове>

Here's some HTML comparing the original, the original with word-break: break-word, the decoded URL, the decoded URL with word-break: break-word and the decoded URL with <wbr> that you can use to compare the behaviour (zooming in will make it easier):

Event Timeline

karapayneWMDE set the point value for this task to 3.
karapayneWMDE moved this task from Unified DOT Backlog to Sprint-∞ on the Wikidata Dev Team board.

Task Breakdown Notes:

  • This might be as straight forward as using the built in JS URL decoder to decode the url
  • In order to protect against malicious queries (these should never be in real sitelink URLs), don’t decode (or, re-encode after decoding) stuff like
    • whitespace characters
    • control characters
    • bidirectional overrides
    • etc. (could be done depending on the Unicode category – we probably want to unescape letters, unescape punctuation, but everything else is TBD)

Change 888052 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[wikidata/query/gui@master] Decode URIs for display

https://gerrit.wikimedia.org/r/888052

I think the above change should work, but I’d like to test it a bit more tomorrow before moving the task into peer review.

  • In order to protect against malicious queries (these should never be in real sitelink URLs), don’t decode (or, re-encode after decoding) stuff like
    • whitespace characters
    • control characters

Some of those characters are required by other scripts and therefore do appear in real sitelink URLs. Zero-width joiner and zero-width non-joiner in particular can be relatively common in Arabic and Indic scripts, and select * { ?sitelink schema:isPartOf <https://fa.wikisource.org/> } limit 1000 includes quite a few zero-width non-joiners.

Change 888052 merged by jenkins-bot:

[wikidata/query/gui@master] Decode URIs for display

https://gerrit.wikimedia.org/r/888052

Change 890778 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: WDQSGuiBuilder):

[wikidata/query/gui-deploy@production] Merging from 1f13e68d3c8e39589eef930efdcecef4f68bbd7a

https://gerrit.wikimedia.org/r/890778

Change 890778 merged by Lucas Werkmeister (WMDE):

[wikidata/query/gui-deploy@production] Merging from 1f13e68d3c8e39589eef930efdcecef4f68bbd7a

https://gerrit.wikimedia.org/r/890778

  • In order to protect against malicious queries (these should never be in real sitelink URLs), don’t decode (or, re-encode after decoding) stuff like
    • whitespace characters
    • control characters

Some of those characters are required by other scripts and therefore do appear in real sitelink URLs. Zero-width joiner and zero-width non-joiner in particular can be relatively common in Arabic and Indic scripts, and select * { ?sitelink schema:isPartOf <https://fa.wikisource.org/> } limit 1000 includes quite a few zero-width non-joiners.

Hm, true, this doesn’t look nice :/

image.png (737×462 px, 121 KB)

Let me see if there’s a more restrictive Unicode category we can use.

Let me see if there’s a more restrictive Unicode category we can use.

Not really – ZWJ/ZWNJ are in Other, format (Cf) together with the directional control characters (U+202E RIGHT-TO-LEFT OVERRIDE and friends), which I don’t think we want to allow in decoded form.

MediaWiki core’s MediaWikiTitleCodec::splitTitleString() hard-codes the bidi characters as forbidden: U+200E-F and U+202A-E. I guess we could do the same, and re-encode those seven while allowing the rest of the Cf category? (But still blocking the other “other” categories: Cc Other, control; Cs Other, surrogate; Co Other, private use; and Cn, Other, not assigned.)

(MediaWiki allows the bidi isolate characters in titles, and indeed U+2066 is a working redirect on enwiki. I’m not sure how I feel about that tbh.)

[...] MediaWiki core’s MediaWikiTitleCodec::splitTitleString() hard-codes the bidi characters as forbidden: U+200E-F and U+202A-E. I guess we could do the same, and re-encode those seven while allowing the rest of the Cf category? [...]

Could we maybe go the opposite way? Having an allow-list of characters in Cf that we explicitly decode? Then we could maybe start with ZWJ/ZWNJ and add further chars as needed. I imagine that this would feel safer and more understandable to me when reading the code.

Could we maybe go the opposite way? Having an allow-list of characters in Cf that we explicitly decode? Then we could maybe start with ZWJ/ZWNJ and add further chars as needed. I imagine that this would feel safer and more understandable to me when reading the code.

Maybe, though I felt like some of the other characters in that Cf list also looked like they might theoretically be useful (some of them are even printable). But we could also start with ZW(N)J, sure.

I discovered the "all-titles" dumps a few days ago and realised I could use them to find page names containing \p{C} or \p{Z}. (I've put the commands I used in P44829)

There are about 1.1 million page names with those characters:

  • Almost all are \p{Cf}
  • ~21k have unassigned characters (\p{Cn}, list: P44820) (some of these were assigned in Unicode 15, I probably need to upgrade something)
  • ~10k have private-use area characters (\p{Co}, list: P44816)
  • ~4k have control characters (\p{Cc}, list: P44812)
  • There are no \p{Z} or \p{Cs}

Of the \p{Cf} characters:

  • ~1 million have zero-width non-joiners
  • ~30k have zero-width joiners (list: P44807)
  • ~30k have zero-width spaces (list: P44806)
  • ~2k have soft hyphens (list: P44801)
  • ~1k have byte-order marks/zero-width non-breaking spaces (list: P44803)
  • ~1k have word joiners (list: P44802)
  • ~500 have tags (list: P44804)
  • ~500 have other \p{Cf} characters (list: P44822)

Looking at that, I would definitely include zero-width space too.

The tag characters are used for some emoji flags and we do have some sitelinks in Wikidata which use them (Q65300420, Q100587671) but I think individual tag characters (when not part of a recognised sequence) are not considered printable characters, so it's probably better to not decode those.

Private-use area characters do appear in some sitelinks (e.g. Q33061193), but whether they display properly or not depends whether a compatible font is used, so there's probably limited benefit to decoding those.

Word joiner characters are the proper way to encode a zero-width non-breaking space (to prevent breaking at that point). I think those would be fine to decode.

Soft hyphens are used to indicate where long words can break and are used in some particularly long page names (e.g. Q101), so not decoding it is counter-productive (e.g. this query's results wouldn't scroll horizontally if the soft hyphen were decoded).

Alright, then let’s allowlist ZWJ, ZWNJ and ZWSP.

Change 892511 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[wikidata/query/gui@master] Decode ZWJ, ZWNJ and ZWSP for display

https://gerrit.wikimedia.org/r/892511

Change 892511 merged by jenkins-bot:

[wikidata/query/gui@master] Decode ZWJ, ZWNJ and ZWSP for display

https://gerrit.wikimedia.org/r/892511

hoo subscribed.

Moving this to verification, as I suppose this is all we're going to do here in terms of edge case handling.

Thank you so much for your research on this Nikki!

And all your work digging into this Lucas, Michael and Marius :)

Change 892900 had a related patch set uploaded (by WDQSGuiBuilder; author: WDQSGuiBuilder):

[wikidata/query/gui-deploy@production] Merging from b6a19cca3172f192f51d8d9552a7eca073d9b59f

https://gerrit.wikimedia.org/r/892900

Change 892900 merged by Lucas Werkmeister (WMDE):

[wikidata/query/gui-deploy@production] Merging from b6a19cca3172f192f51d8d9552a7eca073d9b59f

https://gerrit.wikimedia.org/r/892900