Page MenuHomePhabricator

Provide a way to access unencoded page names for sitelinks
Closed, ResolvedPublic

Description

I can't find a way to return the unencoded page name of a sitelink. There are several scenarios where it would be useful:

Including sitelinks in the results at the moment makes the results hard to read, e.g. You might create a query like this one which selects ruwiki and ukwiki sitelinks for cities in Ukraine to find pages which are missing sitelinks. The length of the sitelink URLs makes the results display as lists rather than columns and since the page name is URL encoded, you can't tell what the page is. Being able to display the page name Чернобыль in the results instead of <https://ru.wikipedia.org/wiki/%D0%A7%D0%B5%D1%80%D0%BD%D0%BE%D0%B1%D1%8B%D0%BB%D1%8C> would be much more useful here.

Another use would be when you want to map Wikidata items to page names (e.g. if you want to make a list of links for a wiki page from a query). For things like that, you want unencoded page names, not the full URL encoded URL. There was someone on IRC the other day trying to do something like that.

A third use case I've had, I wanted to compare labels and sitelink links for some items, to find the ones where the label might need updating, but couldn't do it very effectively because one is URL encoded and the other isn't.

I could imagine having a new predicate so that you can use something like ?sitelink prefix:pageName ?pagename or having a function for decoding the sitelink so that you can do something like bind(decodefunction(substr(str(?sitelink), 31)) as ?pagename).


Result: schema:name is added

Event Timeline

Restricted Application added subscribers: Base, Aklapper. · View Herald Transcript

I would suggest a triple with the predicate schema:name, or perhaps schema:headline.

We can't have unencoded links, because there are rules about which characters can appear in the URL. We can have unencoded strings (strings can have nearly any (sane) character inside) but I'm not sure how you propose to distinguish names from different wikis.

The name/title would be a property of the sitelink, not the item. An item can have multiple sitelinks, but each sitelink ought to have exactly one title.

Hm... that may be possible. We can actually just put rdfs:label on links since they are their own nodes. Will check that.

Workaround:

(REPLACE(REPLACE(REPLACE(strafter(str(?article),"/wiki/"),"%20"," "),"%28","("),"%29",")") as ?title)

;)

@Esc3300 works for English ones, Russian or Chinese ones would be a bit more tricky.

Smalyshev triaged this task as Medium priority.Dec 17 2016, 6:29 AM

Related to T131960 - should name with underscores or spaces be true page name? Right now it's one with spaces, but maybe some encoding is appropriate. Not sure.

Change 327905 had a related patch set uploaded (by Smalyshev):
[WIP] Add plain-text link name to sitelinks, for easier display.

https://gerrit.wikimedia.org/r/327905

Spaces would be better for all three use cases I listed, so I would prefer spaces.

To compare with labels, spaces would be better.

Shouldn't it be "San Francisco"@en as well? (matching label datatype).

I don’t really like the choice of rdfs:label as predicate. Currently, as far as I’m aware, only items have triples with that predicate, and queries that rely on this assumption might break (and the straightforward fix, ?item a wikibase:Item, isn’t available on WDQS). There’s also the datatype issue that @Esc3300 mentioned, but I don’t think it would be correct to claim that every title on enwiki is in English, either (random examples: Q300 is just some identifier, Sposalizio is Italian, …).

@Smalyshev why do we have to stick to URL and its requirements. RDF seems to be all about IRI rather than URL, and those I believe allow Unicode.

Shouldn't it be "San Francisco"@en as well? (matching label datatype).

Well, the thing is you don't know. You's say "it is the language of the wiki" - but there's no guarantee of that! Consider https://ru.wikipedia.org/wiki/ARPANET - it's a Russian-language wiki, but ARPANET is not a Russian word. There's also https://he.wikipedia.org/wiki/ARPANET. There could be words in different language as wiki titles. I don't think language tag there would be of any use, especially if we know it can be wrong. If you need wiki language, you have inLanguage triple, but that does not guarantee the title is actually a word in that language.

only items have triples with that predicate, and queries that rely on this assumption might break

This is not true, properties have labels too. Can you specify a query that would break? I'd say if a query relies on an assumption only items have labels, it's already broken. But maybe I am missing some use case, let's see the query. We could use schema:name or something like that.

why do we have to stick to URL and its requirements

Because otherwise many tools will be unable to consume those. These are not just abstract strings, these actually represent articles in Wikipedia and other wikis. If they will be in the form from which you can't go to an article, that would be defeating the purpose of a sitelink - i.e. link to a site.

This is not true, properties have labels too.

Okay, but they can have statements too, so they’re fairly similar to items in that regard IMO. I can’t give you a concrete example of a query that would break, but it just feels wrong. I already suggested two other predicates above (schema:name or schema:headline) that would, in my opinion, be more appropriate for a schema:Article (which is the rdf:type of a sitelink).

rdfs:label is used for dublin core title .. it seems suitable for WP article titles.

BTW we do have "ARPANET"@ru at Wikidata: query. So yes, I think we can include it.

Yes, but I'm not sure we can safely claim that every article on certain wiki is a string in a language of that wiki. It's easy to add it, I'm just not sure it's right to do it.

I changed it to schema:name and added the language. See the updated patch in gerrit.

Yes, but I'm not sure we can safely claim that every article on certain wiki is a string in a language of that wiki. It's easy to add it, I'm just not sure it's right to do it.

It's not always technically correct, but a valid assumption that should be right in 99% of the cases, and will produce usable results in like 99.99% or something. Good enough.

Change 327905 merged by jenkins-bot:
Add plain-text link name to sitelinks, for easier display.

https://gerrit.wikimedia.org/r/327905