Page MenuHomePhabricator

[Epic] Support for queries on-wiki (automated list generation)
Open, HighPublic

Description

From the development plan:
Users are able to write queries like “all poets who lived in 1982” or “all cities with more than 1 Million inhabitants”. They are entered in a page in the Query namespace and internally saved as JSON. They are then executed when resources are available - usually not immediately. The result is cached. A query can be set to rerun at regular intervals or on-demand by an administrator. The result of the query is shown on the same page. It can also be accessed via the API. The clients can include the result of a query in their pages to for example create list articles. This will enable for example to have automatically updated list articles on Wikipedia.

However it is not decided yet whether the Queries will reside on Wikidata.org or on wiki. It's part of the discussion we will have to define the needs better.

Details

Reference
bz65626

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Isn't this the very definition of the Wikidata-Query-Service, which is now deployed in production, therefore meaning this is resolved?

This is about being able to use the query service from a wiki via a Lua module.

Deskana renamed this task from [Epic] support for complex queries to [Epic] support for complex queries on-wiki using Lua.Dec 23 2015, 6:20 AM

This is about being able to use the query service from a wiki via a Lua module.

I've updated the task title to reflect this.

Lydia_Pintscher renamed this task from [Epic] support for complex queries on-wiki using Lua to [Epic] support for complex queries on-wiki using Lua (automated list generation).Mar 9 2016, 1:56 PM
iecetcwcpggwqpgciazwvzpfjpwomjxn renamed this task from [Epic] support for complex queries on-wiki using Lua (automated list generation) to [Epic] Support for queries on-wiki (automated list generation).Jun 1 2018, 1:25 PM

As one of the possible solutions for this: https://commons.wikimedia.org/wiki/User:TabulistBot

Not sure whether it closes the task, but at least it implements a major part of it.

@Lydia_Pintscher - I wonder if this may be enough?

I am noticing that there's no use of the TabulistBot at all. So I wonder if the use case is there...

Listeria is a 3rd party tool, and we need a feature built-in Wikimedia. Also Listeria does not store the result in any structured way.

Listeria is a 3rd party tool, and we need a feature built-in Wikimedia.

What makes you think so?

Listeria is a 3rd party tool, and we need a feature built-in Wikimedia.

What makes you think so?

Another issue to consider is by updating Listeriabot make an edit to each page, and this clutters page histories for pages with embedded lists. It is much worse if the wiki host thousands of Listeria-generated pages, many of them are barely viewed but refreshed frequently, like cywiki. Snapshots are sometimes useful (T291091: Snapshots for saved queries), but in most times, they are not.

For some queries frequently used, there are various types: one is those that should be regularly refreshed (currently ListeriaBot do it but many pages are not refreshed as frequent as the page defined); one is those that may be refreshed on demand; one is those relatively stable, such as a list of chemical elements or US presidents. For the last type, it is useful to introduce a semi-permanent result and users who want to refresh it may see the diff between result sets. This also keep off most vandalisms, and may even become a way to moniter vandalisms (with some mechanism to regularly check query results). These three types are stored queries.

Also, they may be a parser function to invoke an arbitrary query on-the-fly, probably via some caching mechanism. This may be part of Wikilambda. In addition, there may be a "query template" mechanism to handle some sets of queries whose only difference are some parameters (one example: people died on <date>).

Listeria is a 3rd party tool, and we need a feature built-in Wikimedia. Also Listeria does not store the result in any structured way.

I want to make one point deeper: The current Listeriabot is 100% not a good design - it does multiple things at once.

First we can run a SPARQL query and we have a (raw) result set. secondly Listeria does some basic transform - both automatic and manual: You can look up labels, descriptions, aliases, some statements/qualifiers (this is not relied on SPARQL) and then since 2.0 the values are automatically formated (i.e. turned to links) which (it seems) there are no parameter to disable. Thirdly the results are formated into some table or template calls.

Ideally we should made them three independent part: the query part may be cached in ways described above; result the transform part may be cached, possibly per language if results in interface language is requested (if no terms is requested, or only monolingual terms is required which can already queried in SPARQL, this step can be skipped entirely); and the content is displayed using the cached result.

(If there are multiple possible transforms of one query, a further optimalization may be caching query results by some sorts of hash, or make transforms a top-level object, though the latter may be confuing to some users.)

They are entered in a page in the Query namespace and internally saved as JSON.

Why JSON? We have WDQS, which uses SPARQL, so it seems straightforward for me to store these on-wiki queries in SPARQL as well. (The cached results should probably be JSON, as it’s easier to consume and format for display, but for the queries themselves we don’t need to reinvent the wheel by creating a JSON schema.)

Another feature Listeriabot missing: Listeriabot always assumes that each entry is a item. Thus it is not easily usable for lists of properties, lexemes or non-entities (such as dates). Hacks exist, but they are merely hacks, not proper solution.

If we want to support this in 3rd party wikis, a solution that is not highly coupled with Wikifunctions is recommended. This may be:

  • A feature to store specific query on a query namespace and cached results in some storage outside of WikiLambda, plus a feature to run arbitrary query on-the-fly via Wikilambda, and/or
  • Extracting the async rendering feature of Wikilambda to some other extension and the complex query service will be based on such extension

Update 2025-09-25: This would be important in Wikibase Cloud.

In https://wikitech.wikimedia.org/wiki/User:DCausse/WDQS_Graph_Split_Impact_Analysis it is mentioned that Listeria is heavily throttled by WDQS (only 1.33% success rate). It is therefore not a scalable approach.

Despite of the explanations by Bugreporter I remain a bit skeptical as to whether something matching what's outlined here is a feasible or better approach than improvements to and around Listeria.

For example re "by updating Listeriabot make an edit to each page, and this clutters page histories" that may be but it's also a feature because it enables users to watch changes and for example spot vandalism. Re "Listeria is heavily throttled by WDQS" – isn't there some migration to QLever planned that improves performance? Maybe it could be unthrottled.

By the way, an issue with large lists is that they're difficult to maintain and keep up-to-date (or as comprehensive) in smaller Wikipedias...eg users don't notice when a page once translated gets changed and new entries get added to a list; and it's a lot of work duplication. However, it's not as simple as it sounds on paper with this list generation either. For example, some structuring is done separately outside queries and some properties and/or items (even when in Wikipedia and with category) is not in Wikidata and things like that. One list where I thought of this feature was List of Linux distributions (compare to the dynamic listeria list using Wikidata).

The outline of the vision requested here may sounds great on paper but I wonder how it could be implemented – maybe this could be described in a wish in the ongoing Community Wishlist. There I made this proposal to improve Listeria (voting open) to fix various issues of it, maybe even creating a fork of it etc (please comment on the talk page if you have input on it). Listeria is up and running. It works. What's described here is so far a mere rough vision that may not be feasible or if it is could maybe be implemented via sth like Listeria 2.0. With latest tools it has become easier to create new projects from an existing open source project or in other words to transform it, e.g. regarding "we should made them three independent part(s)".
Beyond this it seems unclear how people here think this could be implemented – again it would be nice if someone/some could elucidate how this could be realized. What's proposed here does sound great.

One of fundamental problem of Listeria is it is based on the current page structure, so the result can not be reused in other wikis. And the result is not machine-readable. We have a solution for both issues though (which needs implementation): let ListeriaBot write to Commons data namespace.

And it currently does not support lexemes. There is another tool for that: https://talikak.toolforge.org/

users don't notice when a page once translated gets changed and new entries get added to a list; and it's a lot of work duplication. However, it's not as simple as it sounds on paper with this list generation either. For example, some structuring is done separately outside queries and some properties and/or items (even when in Wikipedia and with category) is not in Wikidata and things like that.

For such use case my proposed solution is T332484: JSON-based page list.

Listeria is up and running. It works.

But a built-in feature (which does not depends on Wikifunctions either) helps solving T339864: Enable the use of Listeria on wikibase.cloud (or some other way to render page lists) .

isn't there some migration to QLever planned that improves performance? Maybe it could be unthrottled.

Many of use cases of WDQS can be replaced with GraphQL, though not all. Therefore the proposed automated list generation feature needs to support dual backend.