- WBQC: WikibaseQualityConstraints mediawiki extension, deployed on wikidata.org.
- WDQS: The Wikidata Query Service, https://query.wikidata.org.
WBQC runs checks on Wikidata entities on demand from users.
Results of these constraint checks are stored in memcached with a default TTL of 84600 seconds (1 day).
WBQC checks are accessible via 3 methods:
- RDF action https://www.wikidata.org/wiki/Q123?action=constraintsrdf
- Specialpage https://www.wikidata.org/wiki/Special:ConstraintReport/Q123
- API https://www.wikidata.org/w/api.php?action=wbcheckconstraints&&id=Q123
The special page and API can be used by users directly; the API is also called whenever a logged-in user visits an entity page, to display the results on the entity page.
Executions of the API will result in constraint checks being run if stored data is out of date, or cache is absent/expired for the entity.
Executions of the special page currently always re-run the constraint checks, do not get or set via the cache.
The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.
When retrieved from the cache, the WBQC extension has logic built-in to determine if the stored result needs to be updated (because something in the dependency graph has changed).
We are in the process of rolling out a JobQueue job that will re-run constraint checks for an entity post-edit, rather than on only on-demand by a user. T204031
Once constraint checks are stored more persistently we will be able to expose an event queue of the generation of the checks for ingestion into the WDQS, T201147.
Loading /re-loading of data into the WDQS will also present the need to dump all constraint checks.
5,644 out of 5,767 properties on Wikidata currently have constraints that require a (cacheable) check execution.
Roughly 1.85 million items do not have any statements (currently), leaving 52 million items that do have statements and need to have constraint checks run.
Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.
Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on grafana.
Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.
Primary problem statement:
- Constraint check results need to be loaded into WDQS, but we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.
Secondary problem statements:
- Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
- Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.
- Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
- Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
- When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
- Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
- Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
- Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.
Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.