Page MenuHomePhabricator

RFC: Store WikibaseQualityConstraint check data in persistent storage
Open, NormalPublic

Description

This RFC is a result of T204024 and specifically the request for an RFC in T204024#4891344

Vocabulary

  • WBQC: WikibaseQualityConstraints mediawiki extension, deployed on wikidata.org.
  • WDQS: The Wikidata Query Service, https://query.wikidata.org.

Current situation

WBQC runs checks on Wikidata entities on demand from users.
Results of these constraint checks are stored in memcached with a default TTL of 84600 seconds (1 day).

WBQC checks are accessible via 3 methods:

The special page and API can be used by users directly; the API is also called whenever a logged-in user visits an entity page, to display the results on the entity page.

Executions of the API will result in constraint checks being run if stored data is out of date, or cache is absent/expired for the entity.

Executions of the special page currently always re-run the constraint checks, do not get or set via the cache.

The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

When retrieved from the cache, the WBQC extension has logic built-in to determine if the stored result needs to be updated (because something in the dependency graph has changed).

We are in the process of rolling out a JobQueue job that will re-run constraint checks for an entity post-edit, rather than on only on-demand by a user. T204031

Once constraint checks are stored more persistently we will be able to expose an event queue of the generation of the checks for ingestion into the WDQS, T201147.
Loading /re-loading of data into the WDQS will also present the need to dump all constraint checks.

5,644 out of 5,767 properties on Wikidata currently have constraints that require a (cacheable) check execution.

Roughly 1.85 million items do not have any statements (currently), leaving 52 million items that do have statements and need to have constraint checks run.

Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.

Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on grafana.

Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.

Problem statement

Primary problem statement:

  • Constraint check results need to be loaded into WDQS, but we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.

Secondary problem statements:

  • Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
  • Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.

Proposal 1

  • Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
  • Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
  • When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
  • Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
  • Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
  • Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.

Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2019, 11:27 AM
daniel added a subscriber: daniel.EditedJan 23 2019, 9:29 PM

This is the same kind of storage need that is currently causing problems here: T210548: gzip-encoded page properties can't be exported from the API. There are a lot of parallels to T119043: Graph/Graphoid/Kartographer - data storage architecture.

Joe added a subscriber: Joe.Feb 14 2019, 3:14 PM

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?
  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?
  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

daniel moved this task from Inbox to Backlog on the TechCom-RFC board.Feb 19 2019, 1:52 PM

Moving to backlog, since we are waiting for feedback from the RFC's author.

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?

Currently there is only the need to store the latest constraint check data for an item.

  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?

There are currently no other access patterns on the horizon.
(storing the data like this will allow us to load it into the WDQS and query it from there)

  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

The main regular updates for the WDQS will come via events in kafka, and then the WDQS retrieving the constraint check data from an MW API in the same way that it retrieves changes to entities.
The "full dump" is needed when starting a WDQS server off from a fresh start.
The full dump could just be a PHP script that iterates through the storage for all entities and slowly dumps the data to disk (similar to our dumpJson or dumpRdf scripts, and similar to a regular MW dump)

Addshore moved this task from Backlog to Inbox on the TechCom-RFC board.Feb 20 2019, 9:37 AM

Moving back to Inbox as the feedback is now provided and this is again in the court of TechCom

Krinkle updated the task description. (Show Details)Feb 20 2019, 9:43 PM
Joe added a comment.Feb 21 2019, 6:25 AM

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

My point here is quite subtle but fundamental - if we can split reads and write to this datastore based on the HTTP verb, so that constraints would be persisted only via either
a - a specific job enqueued (by user request or by
b - a POST request
it would be possible to store these data in the cheapest k-v storage we have, the ParserCache. That would allow typically be cheaper and faster than using a distributed k-v storage like Cassandra, which I'd reserve for things that need to be written to from multiple datacenters.

I think we can easily have this only persist to the store via the Job or a POST.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.
Once the storage is fully populated this isn't even a case we need to think about.
Purges of the stored data would then happen via a post (similar interface to page purges).

Addshore triaged this task as Normal priority.Feb 21 2019, 8:46 AM

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

Yes we could still trigger a job on the GET :)
Probably cleaner to just have it run again, this won't be a high traffic use case and will slowly vanish so no need to worry about the duplicated effort / wasted cycles.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

That is a good point.
We could consider having different API behaviour for the GET vs POST, POST being the only one that would get fresh results.

We could also do a variety of other things, for example:

  • re run the constraint checks in the web request and then present them to the user
  • send an error to the client saying they are outdated and to retry later while running the checks in the job queue
  • serve the outdated results but trigger a job to update the stored results?

Perhaps we need to figure out exactly how we want this to appear to the user @Lydia_Pintscher so that we can figure out what we want to do in the background.

kchapman moved this task from Inbox to Under discussion on the TechCom-RFC board.Feb 28 2019, 6:32 AM

Just to be clear on this RFC regarding my above comment, we are not waiting for a reply from @Lydia_Pintscher here. The decision is to keep the behaviour for the user the same. This still allows us to only need to write to storage during POSTs and via the job queue.

Krinkle updated the task description. (Show Details)Mar 20 2019, 7:17 PM
abian added a subscriber: abian.Apr 9 2019, 8:43 PM

Just poking this a few months down the line, as far as I know this still rests with TechCom for a decision if discussion has finished?

daniel moved this task from Under discussion to Inbox on the TechCom-RFC board.Fri, Jun 14, 7:35 PM

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

Yup, it seems like it is ready.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

Gotcha