Page MenuHomePhabricator

RFC: Store WikibaseQualityConstraint check data in persistent storage
Open, NormalPublic

Description

This RFC is a result of T204024 and specifically the request for an RFC in T204024#4891344

Vocabulary

  • WBQC: WikibaseQualityConstraints mediawiki extension, deployed on wikidata.org.
  • WDQS: The Wikidata Query Service, https://query.wikidata.org.

Current situation

WBQC runs checks on Wikidata entities on demand from users.
Results of these constraint checks are stored in memcached with a default TTL of 84600 seconds (1 day).

WBQC checks are accessible via 3 methods:

The special page and API can be used by users directly; the API is also called whenever a logged-in user visits an entity page, to display the results on the entity page.

Executions of the API will result in constraint checks being run if stored data is out of date, or cache is absent/expired for the entity.

Executions of the special page currently always re-run the constraint checks, do not get or set via the cache.

The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

When retrieved from the cache, the WBQC extension has logic built-in to determine if the stored result needs to be updated (because something in the dependency graph has changed).

We are in the process of rolling out a JobQueue job that will re-run constraint checks for an entity post-edit, rather than on only on-demand by a user. T204031

Once constraint checks are stored more persistently we will be able to expose an event queue of the generation of the checks for ingestion into the WDQS, T201147.
Loading /re-loading of data into the WDQS will also present the need to dump all constraint checks.

5,644 out of 5,767 properties on Wikidata currently have constraints that require a (cacheable) check execution.

Roughly 1.85 million items do not have any statements (currently), leaving 52 million items that do have statements and need to have constraint checks run.

Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.

Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on grafana.

Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.

Problem statement

Primary problem statement:

  • Constraint check results need to be loaded into WDQS, but we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.

Secondary problem statements:

  • Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
  • Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.

Proposal 1

  • Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
  • Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
  • When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
  • Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
  • Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
  • Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.

Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2019, 11:27 AM
daniel added a subscriber: daniel.EditedJan 23 2019, 9:29 PM

This is the same kind of storage need that is currently causing problems here: T210548: gzip-encoded page properties can't be exported from the API. There are a lot of parallels to T119043: Graph/Graphoid/Kartographer - data storage architecture.

Joe added a subscriber: Joe.Feb 14 2019, 3:14 PM

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?
  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?
  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

daniel moved this task from Inbox to Backlog on the TechCom-RFC board.Feb 19 2019, 1:52 PM

Moving to backlog, since we are waiting for feedback from the RFC's author.

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?

Currently there is only the need to store the latest constraint check data for an item.

  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?

There are currently no other access patterns on the horizon.
(storing the data like this will allow us to load it into the WDQS and query it from there)

  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

The main regular updates for the WDQS will come via events in kafka, and then the WDQS retrieving the constraint check data from an MW API in the same way that it retrieves changes to entities.
The "full dump" is needed when starting a WDQS server off from a fresh start.
The full dump could just be a PHP script that iterates through the storage for all entities and slowly dumps the data to disk (similar to our dumpJson or dumpRdf scripts, and similar to a regular MW dump)

Addshore moved this task from Backlog to Inbox on the TechCom-RFC board.Feb 20 2019, 9:37 AM

Moving back to Inbox as the feedback is now provided and this is again in the court of TechCom

Krinkle updated the task description. (Show Details)Feb 20 2019, 9:43 PM
Joe added a comment.Feb 21 2019, 6:25 AM

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

My point here is quite subtle but fundamental - if we can split reads and write to this datastore based on the HTTP verb, so that constraints would be persisted only via either
a - a specific job enqueued (by user request or by
b - a POST request
it would be possible to store these data in the cheapest k-v storage we have, the ParserCache. That would allow typically be cheaper and faster than using a distributed k-v storage like Cassandra, which I'd reserve for things that need to be written to from multiple datacenters.

I think we can easily have this only persist to the store via the Job or a POST.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.
Once the storage is fully populated this isn't even a case we need to think about.
Purges of the stored data would then happen via a post (similar interface to page purges).

Addshore triaged this task as Normal priority.Feb 21 2019, 8:46 AM

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

Yes we could still trigger a job on the GET :)
Probably cleaner to just have it run again, this won't be a high traffic use case and will slowly vanish so no need to worry about the duplicated effort / wasted cycles.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

That is a good point.
We could consider having different API behaviour for the GET vs POST, POST being the only one that would get fresh results.

We could also do a variety of other things, for example:

  • re run the constraint checks in the web request and then present them to the user
  • send an error to the client saying they are outdated and to retry later while running the checks in the job queue
  • serve the outdated results but trigger a job to update the stored results?

Perhaps we need to figure out exactly how we want this to appear to the user @Lydia_Pintscher so that we can figure out what we want to do in the background.

kchapman moved this task from Inbox to Under discussion on the TechCom-RFC board.Feb 28 2019, 6:32 AM

Just to be clear on this RFC regarding my above comment, we are not waiting for a reply from @Lydia_Pintscher here. The decision is to keep the behaviour for the user the same. This still allows us to only need to write to storage during POSTs and via the job queue.

Krinkle updated the task description. (Show Details)Mar 20 2019, 7:17 PM
abian added a subscriber: abian.Apr 9 2019, 8:43 PM

Just poking this a few months down the line, as far as I know this still rests with TechCom for a decision if discussion has finished?

daniel moved this task from Under discussion to Inbox on the TechCom-RFC board.Jun 14 2019, 7:35 PM

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

Yup, it seems like it is ready.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

Gotcha

daniel added a comment.EditedJun 25 2019, 9:06 AM

Side note: we discussed this use case at the CPT offsite. It seems like this would fit a generalized parser cache mechanism. This is something we will have to look into for the integration of Parsoid in MW core anyway, but it's at least half a year out, still.

See T227776: Generalize ParserCache into a generic service class for large "current" page-derived data

Just a quick question: "this would fit a generalized parser cache mechanism" meaning it would fit into the existing parsercache mechanism (and infrastructure) or is that still to be defined?
Thanks!

Just a quick question: "this would fit a generalized parser cache mechanism" meaning it would fit into the existing parsercache mechanism (and infrastructure) or is that still to be defined?

It doesn't fit the current mechanism, since we would need an additional key (or key suffix).

The infrastructure for the new generalized cache is not yet defined. It would cover the functionality of the current parser cache (Memcached+SQL) and the Parsoid cache (currently Cassandra). It's not yet clear which of the two the unified mechanism would use, or if it should use something else entirely. So far, the generalized parser cache is just an idea. There is no plan yet.

daniel moved this task from Inbox to Backlog on the TechCom-RFC board.Jul 11 2019, 2:27 PM
daniel added a subscriber: mobrovac.

Moved to the RFC backlog for improvement after discussion at the TechCom meeting. The proposed functionality seems sensible enough, but this ticket is lacking information about system design that is needed to make this viable as an RFC.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created. Which raises a number of questions, like:

  • what volume of data do you expect that store to hold?
  • should data ever be evicted? Does it *have* to be evicted?
  • how bad is it if we lose some data unexpectedly? How bad is it for all the data to become unavailable?
  • what's the read/write load?
  • what are the requirements for cross-DC replication?
  • what transactional consistency requirements exist?
  • what's the access pattern? Is a plain K/V store sufficient, or are other kinds of indexes/queries needed?

Also, so you have a specific storage technology in mind? In discussions about this, Cassandra seems to regularly pop up, but it's not in the proposal. As far as I know, there is currently no good way to access Cassandra directly from MW core (not abstraction layer, but apparently also no decent PHP driver at all, and IIRC there are also issues with network topology).

I was hoping for @Joe and @mobrovac to ask more specific questions, but they are both on vacation right now. Perhaps get together with them to hash out a proposal when they are back.

Moved to the RFC backlog for improvement after discussion at the TechCom meeting. The proposed functionality seems sensible enough, but this ticket is lacking information about system design that is needed to make this viable as an RFC.
Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created.

I guess the closest thing we have like it right now would be the parser cache system backed by MySQL.

Which raises a number of questions, like:

  • what volume of data do you expect that store to hold?

I can't talk in terms of bytes right now, but I we can add a bit of tracking to our current cache to try and figure out an average size and figure out a rough total size from that if that''s what we want.
If we are talking about number of entries, this would roughly line up with the number of wikidata entities, which is right now 58 million.

  • should data ever be evicted? Does it *have* to be evicted?

It does not *have* to be "evicted", but there will be situations where it is detected to be out of date and thus regenerated.

  • how bad is it if we lose some data unexpectedly?

Not very, everything can and will be regenerated, but takes time.

How bad is it for all the data to become unavailable?

Unavailable or totally lost?
Unavailable for a short period of time would not be critical.
Unavailable for longer periods of time could have knock on effects to other services such as WDQS not being able to update fully once T201147 is complete, but I'm sure whatever update code is created would be able to handle such a situation.

Totally loosing all of the data would be pretty bad, it would probably take an extreme amount of time to regenerate at a regular pace for all entities.

  • what's the read/write load?

Write load once the job is fully deployed would be roughly the wikidata edit rate, but limited / controlled by the job queue rate for "constraintsRunCheck".
This can be guesstimated at 250-750 per minute max, but there will also be de-duplication for edits to the same pages to account for there.
If more exact numbers are required we can have a go at figuring that out.
Currently the job is only running on 25% of edits.

Read rate can currently be seen at https://grafana.wikimedia.org/d/000000344/wikidata-quality?panelId=6&fullscreen&orgId=1
On top of this the WDQS updaters would also be needing this data once generated.
This would either be via a http api request which would likely hit the storage, or this could possibly be sent in some event queue?

  • what are the requirements for cross-DC replication?

Having the data accessible from both DCs (for the DC failover case) should be a requirement.

  • what transactional consistency requirements exist?

Not any super important requirements here.
If we write to the store we would love for it to be written and readable in the next second.
Writes for a single key will not really happen too close together, probably multiple seconds between them.
Interaction between keys and order of writes being committed to the store isn't really important.

  • what's the access pattern? Is a plain K/V store sufficient, or are other kinds of indexes/queries needed?

Just a plain K/V store.

Also, so you have a specific storage technology in mind? In discussions about this, Cassandra seems to regularly pop up, but it's not in the proposal. As far as I know, there is currently no good way to access Cassandra directly from MW core (not abstraction layer, but apparently also no decent PHP driver at all, and IIRC there are also issues with network topology).

For technology we don't have any particular preferences, whatever works for the WMF, ops and tech comm.
Ideally something that we would be able to get access to and start working with sooner rather than later.

I was hoping for @Joe and @mobrovac to ask more specific questions, but they are both on vacation right now. Perhaps get together with them to hash out a proposal when they are back.

More than happy to try and hash this out a bit more in this ticket before passing it back to a tech comm meeting again.
It'd be great to try and make some progress here in the coming month.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created.

I guess the closest thing we have like it right now would be the parser cache system backed by MySQL.

I'd be interested in your thoughts on T227776: Generalize ParserCache into a generic service class for large "current" page-derived data.

daniel moved this task from Backlog to Inbox on the TechCom-RFC board.Wed, Jul 31, 5:27 AM
daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Wed, Aug 7, 9:00 PM
daniel added a subscriber: Krinkle.

Moving to under discussion. @mobrovac and @Joe said they had further questions. @Krinkle has some ideas as well.

Addshore added a comment.EditedFri, Aug 9, 6:36 PM

Without digging into it too deeply that sounds like exactly what we need.

Moving to under discussion. @mobrovac and @Joe said they had further questions. @Krinkle has some ideas as well.

I'm keen to answer more questions and hear ideas!