Page MenuHomePhabricator

[Task] investigate https://tools.wmflabs.org/wikidata-externalid-url to see if we can do some of that in Wikibase directly
Closed, ResolvedPublic

Description

Right now some external identifiers are linked via https://tools.wmflabs.org/wikidata-externalid-url. This is not ideal as for example we will not have the proper URIs in the RDF dumps.
Investigate if/what/how we can do some of the conversions the tool does in Wikibase directly. The code is at https://github.com/arthurpsmith/wikidata-tools/blob/master/wikidata-externalid-url/index.php

Event Timeline

  1. The IMDB handler ([[https://www.wikidata.org/wiki/Property:P345|P345]]) just maps various id prefixes to URLs, that's best addressed by splitting the property in question (which has been discussed, but declined before). Various consumers (for example the English Wikipedia), have similar handling for separating the various kinds of IMDB ids.
  2. [[https://www.wikidata.org/wiki/Property:P213|P213]], [[https://www.wikidata.org/wiki/Property:P919|P919]] and possibly others perform simple string changes which could also be applied to the values saved. In some cases that might not be desirable, though (because the ID representation used for linking might differ from the actual ID).
  3. The cricket archive handler ([[https://www.wikidata.org/wiki/Property:P2698|P2698]]) uses different paths in the URL, based on the value of the id. This might be solvable splitting the property or by asking the external site to implement a more generic URL scheme.

In cases like #1 and #3 a step forward could be to get in contact with the target sites and ask for a single, well defined schema URI (or something like that). For other cases changing the saved values or splitting properties might be a way forward.

The code at https://github.com/arthurpsmith/wikidata-tools/blob/master/wikidata-externalid-url/index.php seems fairly trivial compared to disrupting users and current processes by splitting properties.

Is there a problem with specific lines in https://github.com/arthurpsmith/wikidata-tools/blob/master/wikidata-externalid-url/index.php that can't be written in Wikibase?

Is there a problem with specific lines in https://github.com/arthurpsmith/wikidata-tools/blob/master/wikidata-externalid-url/index.php that can't be written in Wikibase?

Well, Wikibase should not know about its content, thus we can't "just" build such transformations into it. I proposed some steps to eliminate such transformations above.

As a last resort, we could have a Wikimedia specific hack for the transformations which we can't do by other means (but it seems to me, we can avoid doing that).

As you noted, solution #1 isn't possible, so we need to find another way. Asking others to set up redirects might be an easy solution, but I'm not sure if using another redirect service is really good practice.

I can understand that you don't want to include code specific to each url, but between doing this and writing a module that allows freely configurable transformations, there should be middle way.

One solution could be to offer a list of transformations that can be applied. A qualifier on the formatter url can select it.

  • Qualifier value "1" would be the default: replace $1 with value (As it's currently done).
  • Qualifier value "2": strip whitespaces, then replace $1 with value
  • Qualifier value "3": only apply to values starting with a specific string (defined in another qualifier).

Ideally these could also be built in SPARQL when needed.

A solution for [[https://www.wikidata.org/wiki/Property:P213|P213]],[[https://www.wikidata.org/wiki/Property:P919|P919]] and possibly others would be adding capturing groups to the regular expression and using these capturing groups to create the URL.

For [[https://www.wikidata.org/wiki/Property:P919|P919]] this could look as follows:
*regular expression: (\d{2})-(\d{4})
*formatter URL: http://www.bls.gov/soc/2010/soc$2$3.htm

We should aim for a solution that solves all these problems. Regular expressions are problematic (security and performance risk, very high complexity) and should be avoided. I would love to see us using Lua snippets for such case-by-case formatting. Using a real programming language is the only solution that will solve all possible problems, and Lua is perfect because it is build for exactly such use cases.

Instead of the formatter URL (P1630) there would be an other property with a link to an (edit-protected) Lua module.

The same Lua module could also provide an actual parser and validator in addition to the formatter. Inputs and outputs are always strings and nothing else, but even this would already help a lot. For example: It would be possible to paste a full IMDb URL, and the parser would strip the unwanted stuff. The validator would make sure only real IMDb identifiers are pasted. And the formatter would turn this back into a URL.

Yea, "but performance", I hear you. Any other idea that would be as awesome as using Lua?

I think it would be awesome to get some things done ..

Ways to build URL for external IDs

Here are some thoughts on this topic:

  1. Replace $1 with id in URL template like http://domain.com/$1 (current way)
  2. Use external IDs format that will work with current approach (ex.: for https://cricketarchive.com/Archive/Players/41/41464/41464.html external ID should be 41/41464/41464 but not 41464)
    • PRO No need to implement anything
    • CONS A bit more space in DB
    • CONS Decision is up to community
    • CONS Have to migrate data
  3. Splitting the property into multiple properties
    • PRO IMDB template on English Wikipedia has different template for characters, companies, ...
    • CONS It is up to comunity to decide (and seems like the proposal was declined)
    • CONS Doesn't cover all the cases (ex.: CricketArchive)
    • CONS Have to migrate templates that already use current property (ex.: IMDB template on German Wikipedia)
  4. Wikibase provides list of predefined generic transformations, which then somehow set for each external identifier property individually (for example, as classifiers for formatter URL property) and applied to value on render.
    • PRO Easy to use
    • PRO Simple cases will look pretty elegant (like "strip spaces", "remove dashes")
    • CONS Have to be supported by developers
    • CONS Will have some super specific transformations (like "IMDB id transformation" and "CricketArchive ID transformation", because URL generation algorithms are very specific)
    • CONS Probably new data type should be added
  5. Use full URLs as identifiers instead of some string IDs
    • PRO If we treat URL as URI we don't have to do formatting and will have URL for free
    • CONS More space in DB
    • CONS Decision is up to community
    • CONS Have to migrate data
  6. Ask the external site to implement a more generic URL scheme
    • PRO We don't have to do anything except writing mails
    • CONS As soon as we want stable URL our request to the site owner would look something like this (in simple words): "Could you please create and maintain one more public interface because our code is not flexible enough and also we don't want to change it". It does not sound as something people would agree to do.
  7. Use regex with capturing groups as qualifier and replace $ variables in formatter URL accordingly (ex.: (\d{2})-(\d{4}) and http://www.bls.gov/soc/2010/soc$2$3.htm)
    • PRO Not so hard to implement
    • PRO Easy to understand how to use
    • PRO Easy to handle simple cases
    • CONS Does not cover all the current cases (ex.: IMDB, HURDAT, ZVG, CricketArchive)
    • CONS performance is undefined and heavily depends on user input
  8. Use Lua code as property's property to define formatter function (as a raw code or a module reference).
    • PRO Performance can be controlled. As soon as we run code in the sandbox we can(?) limit its memory consumption and CPU time
    • PRO Users are more involved in product development and can do much more without developers
    • PRO Some nice features as validation and filtering/preformatting on property basis can be easily implemented after this one is
    • CONS Wikibase start to depend (may be optionally) on Lua PHP extension which might be is hard to install, depending on the environment (good buy shared hosting).
    • CONS Might be relatively hard to implement
    • CONS Performance depends on user input
    • CONS Probably, new data type should be added

Ways to build URL for external IDs

Here are some thoughts on this topic:

  • Use external IDs format that will work with current approach (ex.: for https://cricketarchive.com/Archive/Players/41/41464/41464.html external ID should be 41/41464/41464 but not 41464)
    • PRO No need to implement anything
    • CONS A bit more space in DB
    • CONS Decision is up to community
    • CONS Have to migrate data

You're not really storing the external ID, but a working value for the actual external id. So our date isn't the correct data. So doesn't seem like a true solution

  • Use full URLs as identifiers instead of some string IDs
    • PRO If we treat URL as URI we don't have to do formatting and will have URL for free
    • CONS More space in DB
    • CONS Decision is up to community
    • CONS Have to migrate data

You're not really storing the external ID, but a working value for the actual external id. So our date isn't the correct data. So doesn't seem like a true solution

  • Ask the external site to implement a more generic URL scheme
    • PRO We don't have to do anything except writing mails
    • CONS As soon as we want stable URL our request to the site owner would look something like this (in simple words): "Could you please create and maintain one more public interface because our code is not flexible enough and also we don't want to change it". It does not sound as something people would agree to do.

I agree, this doesn't sound like something other people want to do

  • Use regex with capturing groups as qualifier and replace $ variables in formatter URL accordingly (ex.: (\d{2})-(\d{4}) and http://www.bls.gov/soc/2010/soc$2$3.htm)
    • PRO Not so hard to implement
    • PRO Easy to understand how to use
    • PRO Easy to handle simple cases
    • CONS Does not cover all the current cases (ex.: IMDB, HURDAT, ZVG, CricketArchive)
    • CONS performance is undefined and heavily depends on user input

Since it doesn't cover all cases, it isn't a true solution

Just my 2 cents on some of the possible solutions

[..]

  • 3. Splitting the property into multiple properties

also: * CONS can't cover use cases like ISIL
also: * CONS leads to an indefinite number of properties for basic uses like catalog codes

  • 4. Wikibase provides list of predefined generic transformations, which then somehow set for each external identifier property individually (for example, as classifiers for formatter URL property) and applied to value on render.
    • CONS Will have some super specific transformations (like "IMDB id transformation" and "CricketArchive ID transformation", because URL generation algorithms are very specific)

Maybe it's just matter of coding. It doesn't need to be done in the same way as on tools.wmflabs.org/wikidata-externalid-url

  • 7. Use regex with capturing groups as qualifier and replace $ variables in formatter URL accordingly (ex.: (\d{2})-(\d{4}) and http://www.bls.gov/soc/2010/soc$2$3.htm)
    • CONS Does not cover all the current cases (ex.: IMDB, HURDAT, ZVG, CricketArchive)

I'm not really an expert on regex either, but regexes for IMDb were already available before the current exernal-id solution. Maybe it's just a matter of implementing this correctly.

  • 8. Use Lua code as property's property to define formatter function (as a raw code or a module reference).

also: * CONS Have to be supported by developers

  1. Use Lua code as property's property to define formatter function (as a raw code or a module reference).

also: * CONS Have to be supported by developers

Probably I didn't put it the right way:
The idea is to allow "any" user to add and/or modify the Lua code (which will be value of a property or contained in some module), so there won't be any dependency on developers and the code will be supported by the community.
@Esc3300 Or did I miss something?

I am not the most proper person to review the draft linked above but just wanted to comment in case it is a general problem. I cannot access this draft.
When you submit a patch as a draft only people who you add as reviewers can open it in gerrit. It is fine if it was what you intended. But as you put a link in here, I started to woner.

@Aleksey_WMDE : even LUA requires maintenance and support by devs .. currently there is no similar feature

An advantage of using regex is that it could also be reproduced on query server

Lydia_Pintscher moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.

Ok based on the investigation and discussions: It is possible to handle the special cases. We will introduce a way to make a statement on the property that links to a Lua module that provides the correct link for the given identifier.

Just a followup question:
"We will introduce a way to make a statement on the property that links to a Lua module that provides the correct link for the given identifier."
Does this exist yet or is it in progress?
@Lydia_Pintscher

We didn't have the time for it yet unfortunately.

Change 335832 abandoned by Thiemo Kreuz (WMDE):
Some code that uses LuaCode as formatter

Reason:
The fact this is marked as work in progress as well as the commit message probably mean this was created as a demonstration for T151329. The ticket is resolved. The author is not an employee any more. This patch can still be found from the ticket, in case it is still needed.

https://gerrit.wikimedia.org/r/335832