Page MenuHomePhabricator

Share complex SPARQL queries in Wikidata Query Service via short URL [investigation]
Open, Needs TriagePublic

Description

Main components:

  • Wikidata Query Service UI

User story:
As a user of the Wikidata Query Service,
I want to share complex queries via a short URL
in order to share my work with others (e.g. via Twitter).

Problem:

  • Currently the maximum size for the URL to be shortened on the URL shortener is 2000 characters. This is not enough for complex Wikidata queries. It gets even worse if Unicode characters are involved.
  • Ideally, we would want to encourage well-documented queries. A limit in query length is discouraging this.
  • A suggested T220703: Increase the max length of URL to be shortened was declined for security reasons. We need to find a different solution for this.

Example:

Solution:
Ideas for solutions:

  • The current URL-encoded strings have high redundancy. This means that a compression algorithm that is optimized for WDQS / the SPARQL Unicode syntax (e.g. based on Unishox or Brotli) might enable users to create complex scripts that still fit into the 2000 character limit of the URL shortener.
  • Alternative mechanism for shortening URLs (e.g. T220703#7499438)
  • ...

Acceptance criteria:

  • investigation for now

Open questions:

  • What other solutions could you think of?
  • What compression algorithm would be best suited for this purpose?
    • And which of those have browser-javascript implementations?

Event Timeline

Manuel renamed this task from Share complex SPARQL queries in Wikidata Query Service via a short URL to Share complex SPARQL queries in Wikidata Query Service via short URL [investigation] .Nov 12 2021, 2:33 PM
Manuel updated the task description. (Show Details)

We have an as-yet unused "Query:" namespace ; perhaps it could be opened for storing queries longer than 2000 characters? (A redirection mechanism such that a visit to any page in that namespace takes you to query.wikidata.org might be useful, and there ought to be a system, similar to what is done with CSS pages, that ensures that a query is syntactically correct before it can be saved to that namespace.)

What other solutions could you think of?

Building on what was suggested in T220703#7499438: Why not do that in the browser?
A small piece of js code in the query service gui, when provided with a page title (and optionally a revision id), it requests the page content via the API and if it finds a {{SPARQL|query=...}}, then it redirects to that query.

I'm probably missing something, because to me that sounds pretty straight forward.

We have an as-yet unused "Query:" namespace ; perhaps it could be opened for storing queries longer than 2000 characters? (A redirection mechanism such that a visit to any page in that namespace takes you to query.wikidata.org might be useful, and there ought to be a system, similar to what is done with CSS pages, that ensures that a query is syntactically correct before it can be saved to that namespace.)

Mh, that could be an option. We would maybe move it out of Wikibase itself and create a new extension similar to EntitySchema? Effort would probably similar to that of creating the EntitySchema extension. Also, IIRC, creating the EntitySchema extension had some kind of aftermath with the community requesting some more functionality or integration or something? I wasn't involved in that anymore, but we should probably try to anticipate such requests, if we would want to go that route.

Edit: Though probably with a dedicated new namespace.

What compression algorithm would be best suited for this purpose?

And what have browser-javascript implementations? One possible example would be https://github.com/pieroxy/lz-string/

The trouble with compression is that we can never change it again. Since old queries will have been stored in some compressed format, that compression algorithm must be supported forever or they will become unreadable. That isn't great...

Manuel updated the task description. (Show Details)

The trouble with compression is that we can never change it again.

I see. Maybe some very simple dictionary approach could work? The syntax is highly redundant so that might go a long way. It would also leave the resulting string still interpretable. Not the perfect solution anyways.

The trouble with compression is that we can never change it again.

I see. Maybe some very simple dictionary approach could work? The syntax is highly redundant so that might go a long way. It would also leave the resulting string still interpretable. Not the perfect solution anyways.

The syntax isn't that redundant. Eg. if you consider the first example linked above:

#defaultView:Map
SELECT ?s ?sLabel ?coor ?operator ?operatorLabel ?image ?layer WHERE {  
  
       ?s wdt:P31/wdt:P279* wd:Q28564 ;
          wdt:P17 wd:Q145 ;
          wdt:P625 ?coor ;
          wdt:P137 ?operator ;
          OPTIONAL {?s wdt:P18 ?image .}
  
  VALUES ?operator { wd:Q4923796 wd:Q4966533 wd:Q5016926 wd:Q5038400 wd:Q5043224 wd:Q5064127 wd:Q5166758 wd:Q5256629 wd:Q16837157
                   wd:Q5623821 wd:Q6083890 wd:Q16997658 wd:Q6901162 wd:Q6984500 wd:Q16998902 wd:Q7161994 wd:Q7236943 wd:Q7321391 
                                 wd:Q5123523 wd:Q7825688 wd:Q7909538 wd:Q8038115 }
BIND( 
       IF(?operator = wd:Q4923796, "Blaenau Gwent", 
       IF(?operator = wd:Q4966533, "Bridgend", 
       IF(?operator = wd:Q5016926, "Caerphilly",  
       IF(?operator = wd:Q5038400, "Cardiff",   
       IF(?operator = wd:Q5043224, "Carmarthenshire", 
       IF(?operator = wd:Q5064127, "Ceredigion",   
       IF(?operator = wd:Q5166758, "Conwy",
       IF(?operator = wd:Q5256629, "Denbighshire",
       IF(?operator = wd:Q16837157, "Flintshire", 
       IF(?operator = wd:Q5623821, "Gwynedd",   
       IF(?operator = wd:Q6083890, "Isle of Anglesey",
       IF(?operator = wd:Q16997658, "Merthyr Tydfil",
       IF(?operator = wd:Q6901162, "Monmouthshire",
       IF(?operator = wd:Q6984500, "Neath Port Talbot",   
       IF(?operator = wd:Q16998902, "Newport",
       IF(?operator = wd:Q7161994, "Pembrokeshire", 
       IF(?operator = wd:Q7236943, "Powys",   
       IF(?operator = wd:Q7321391, "Rhondda Cynon Taf",
       IF(?operator = wd:Q5123523, "Swansea",
       IF(?operator = wd:Q7825688, "Torfaen",   
       IF(?operator = wd:Q7909538, "Vale of Glamorgan", 
       IF(?operator = wd:Q8038115, "Wrexham",
                   "")))))))))))))))))))))) AS ?layer).  
  
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY ?operatorLabel ?sLabel

We can not replace the variables (?operator), the ids (Q7825688, P31) and the string literals ("Blaenau Gwent") with a code-side dictionary. That leaves only the literal SPARQL keywords (OPTIONAL) and some common constructs (SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }). That's not nothing but it is much less than it might seem.
Also, it is still more effort than it might seem because we have to be very careful in our work as we might not able to fix some mistakes after launch. And it is very easy to make mistakes here, e.g. by replacing a keyword with a key that also appears in the actual content.
Writing your own compression algorithm isn't as bad as writing your own encryption algorithm, but similar arguments still apply.

So, if we were to go the route of compression, then I would recommend using a very stable implementation of that LZ algorithm from the '70s, so that we can hope that it will still be around for a while. (See also: Lindy effect)

But at the end of the day, I think @Legoktm initial suggestion is still the most practical one with the best value-for-effort: Just save your long query in a Wiki page and share that.

We can probably make the process around that a bit nicer and with less friction:

  • a better error message when trying to shorten an overly long query, telling the user what to do instead
  • maybe a more versatile template that also optionally includes an embed link
  • maybe a help page that nicely explains how to do this

I don't see why that wouldn't resolve most of the issues without us having to hack together some tool that will keep being a nightmare to maintain down the road.

I've never really understood why URLs are the preferred mechanism for storing these queries. They're nice for a one-off, but make collaboration (what wikis excel at) incredibly difficult, because if you want to adjust or update a query, you have to create a brand new URL and redistribute it.

Is it just me or is that not even a goal of WDQS? Collaboration isn't even mentioned in the user story. Especially if we're talking about documenting queries, that seems like something you really want collaboration support for. We have a really good mechanism for storing text, distributing text, searching/discovering text, patrolling changes to text, and linking to text (wikis, in case it wasn't clear), so I just don't get why people want to do it all in URLs.

Finding some fancy compression algorithm to squeeze even more characters into less bytes is a solution technically, but I don't think it really makes the service any better.

We have an as-yet unused "Query:" namespace ; perhaps it could be opened for storing queries longer than 2000 characters? (A redirection mechanism such that a visit to any page in that namespace takes you to query.wikidata.org might be useful, and there ought to be a system, similar to what is done with CSS pages, that ensures that a query is syntactically correct before it can be saved to that namespace.)

IIRC from the very early days when we deployed the query namespace, it was intended for storing queries that would generate lists, like ListeriaBot basically but built-in to the software. But at that time the original plan was to build WDQS as a MediaWiki extension rather than as a separate service. Namespaces are relatively cheap to add, so it would be cool to see the Query namespace used for something like this. And if the original plan does ever pan out, it could be added as a "List" namespace or something.

I've never really understood why URLs are the preferred mechanism for storing these queries. They're nice for a one-off, but make collaboration (what wikis excel at) incredibly difficult, because if you want to adjust or update a query, you have to create a brand new URL and redistribute it.

To be fair, the desired use case is probably Twitter and such:

timeline of coups d'Γ©tat of the 21st century (including attempted ones): https://query.wikidata.org/embed.html#%23...

And the link is https://query.wikidata.org/embed.html#%23defaultView%3ATimeline%0ASELECT%20%3Fcoup%20%3FcoupLabel%20%28MIN%28%3Fdate%29%20AS%20%3Fdate_%29%20%28SAMPLE%28%3Fimage%29%20AS%20%3Fimage_%29%20WHERE%20%7B%0A%20%20%3Fcoup%20wdt%3AP31%2Fwdt%3AP279*%20wd%3AQ45382%3B%0A%20%20%20%20%20%20%20%20wdt%3AP585%20%3Fdate.%20hint%3APrior%20hint%3ArangeSafe%20true.%0A%20%20FILTER%28%3Fdate%20%3E%3D%20%222001-00-00%22%5E%5Exsd%3AdateTime%0A%20%20%20%20%20%20%20%20%26%26%20%3Fdate%20%3C%20%222101-00-00%22%5E%5Exsd%3AdateTime%29%0A%20%20OPTIONAL%20%7B%20%3Fcoup%20wdt%3AP18%20%3Fimage.%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22[AUTO_LANGUAGE]%2Cen%22.%20%7D%0A%7D%0AGROUP%20BY%20%3Fcoup%20%3FcoupLabel

So it isn't so much about sharing the query as it is about sharing the result with as little overhead for the reader as possible.

Thank you @Michael! I will now try to come up with a task for the next steps.

@Mahir256: No worries, I did not forget this!

Stuff mentioned on Telegram:

  • paste.toolforge.org
  • creating a new wiki page every time will probably be too complicated to just send a link to someone