Page MenuHomePhabricator

RFC: Store WikibaseQualityConstraint check data in persistent storage
Open, MediumPublic

Description

  • Affected components: WikibaseQualityConstraint (MediaWiki extension for wikidata.org), Wikidata Query Service.
  • Engineer for initial implementation: Wikidata team (WMDE).
  • Code steward: Wikidata team (WMDE).

Motivation

This RFC is a result of T204024: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache. Specifically, the request to create an RFC in T204024#4891344.

Vocabulary:

  • WikibaseQualityConstraints (WBQC, or MediaWiki), a MediaWiki extension deployed on www.wikidata.org.
  • Wikidata Query Service (WDQS, or Query service), the service at https://query.wikidata.org.

Current situation:

The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached in MediaWiki (WBQC) using Memcached with a default TTL of 1 day (84600 seconds).

The constrain checks are accessible via 3 methods:

The special page and API can be used by users directly. The API is called by client-side javascript whenever a logged-in user visits an entity page on www.wikidata.org, the JS then displays the results on the entity page.

The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired for the current entity. When the API retrieves data from the cache, the WBQC extension has logic built-in to determine if the stored result needs to be updated (i.e because something in the dependency graph has changed).

We are in the process of rolling out a JobQueue job that will re-run constraint checks for an entity post-edit, rather than on only on-demand by a user (T204031). This way, they are more likely to be in-cache when requested shortly after by the Query service. We could make the Job emit some kind of event that informs the Query service to pull the API to ingest the new data (T201147).
Loading and re-loading of data into the WDQS will also present the need to dump all constraint checks.

5,644 out of 5,767 properties on Wikidata currently have constraints that require a (cacheable) check to be executed. Of the roughly 54 million items, 1.85 million items do not have any statements, leaving 52 million items that do have statements and need to have constraint checks run. Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.

Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on Grafana.

Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.

Problem statement:

Primary problem statement:

  • Constraint check results need to be loaded into WDQS, but we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.

Secondary problem statements:

  • Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
  • Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.
Requirements

(Specify the requirements that a proposal should meet.)


Exploration

Proposal
  • Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
  • Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
  • When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
  • Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
  • Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
  • Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.

Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2019, 11:27 AM
daniel added a subscriber: daniel.EditedJan 23 2019, 9:29 PM

This is the same kind of storage need that is currently causing problems here: T210548: gzip-encoded page properties can't be exported from the API. There are a lot of parallels to T119043: Graph/Graphoid/Kartographer - data storage architecture.

Addshore moved this task from incoming to blocked on others on the Wikidata board.Feb 7 2019, 3:55 PM
Joe added a subscriber: Joe.Feb 14 2019, 3:14 PM

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?
  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?
  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

daniel moved this task from P1: Define to Old on the TechCom-RFC board.Feb 19 2019, 1:52 PM

Moving to backlog, since we are waiting for feedback from the RFC's author.

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?

Currently there is only the need to store the latest constraint check data for an item.

  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?

There are currently no other access patterns on the horizon.
(storing the data like this will allow us to load it into the WDQS and query it from there)

  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

The main regular updates for the WDQS will come via events in kafka, and then the WDQS retrieving the constraint check data from an MW API in the same way that it retrieves changes to entities.
The "full dump" is needed when starting a WDQS server off from a fresh start.
The full dump could just be a PHP script that iterates through the storage for all entities and slowly dumps the data to disk (similar to our dumpJson or dumpRdf scripts, and similar to a regular MW dump)

Addshore moved this task from Old to P1: Define on the TechCom-RFC board.Feb 20 2019, 9:37 AM

Moving back to Inbox as the feedback is now provided and this is again in the court of TechCom

Krinkle updated the task description. (Show Details)Feb 20 2019, 9:43 PM
Joe added a comment.Feb 21 2019, 6:25 AM

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

My point here is quite subtle but fundamental - if we can split reads and write to this datastore based on the HTTP verb, so that constraints would be persisted only via either
a - a specific job enqueued (by user request or by
b - a POST request
it would be possible to store these data in the cheapest k-v storage we have, the ParserCache. That would allow typically be cheaper and faster than using a distributed k-v storage like Cassandra, which I'd reserve for things that need to be written to from multiple datacenters.

I think we can easily have this only persist to the store via the Job or a POST.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.
Once the storage is fully populated this isn't even a case we need to think about.
Purges of the stored data would then happen via a post (similar interface to page purges).

Addshore triaged this task as Medium priority.Feb 21 2019, 8:46 AM

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

Yes we could still trigger a job on the GET :)
Probably cleaner to just have it run again, this won't be a high traffic use case and will slowly vanish so no need to worry about the duplicated effort / wasted cycles.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

That is a good point.
We could consider having different API behaviour for the GET vs POST, POST being the only one that would get fresh results.

We could also do a variety of other things, for example:

  • re run the constraint checks in the web request and then present them to the user
  • send an error to the client saying they are outdated and to retry later while running the checks in the job queue
  • serve the outdated results but trigger a job to update the stored results?

Perhaps we need to figure out exactly how we want this to appear to the user @Lydia_Pintscher so that we can figure out what we want to do in the background.

Just to be clear on this RFC regarding my above comment, we are not waiting for a reply from @Lydia_Pintscher here. The decision is to keep the behaviour for the user the same. This still allows us to only need to write to storage during POSTs and via the job queue.

Krinkle updated the task description. (Show Details)Mar 20 2019, 7:17 PM
abian added a subscriber: abian.Apr 9 2019, 8:43 PM

Just poking this a few months down the line, as far as I know this still rests with TechCom for a decision if discussion has finished?

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

Yup, it seems like it is ready.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

Gotcha

daniel added a comment.EditedJun 25 2019, 9:06 AM

Side note: we discussed this use case at the CPT offsite. It seems like this would fit a generalized parser cache mechanism. This is something we will have to look into for the integration of Parsoid in MW core anyway, but it's at least half a year out, still.

See T227776: Generalize ParserCache into a generic service class for large "current" page-derived data

Just a quick question: "this would fit a generalized parser cache mechanism" meaning it would fit into the existing parsercache mechanism (and infrastructure) or is that still to be defined?
Thanks!

Just a quick question: "this would fit a generalized parser cache mechanism" meaning it would fit into the existing parsercache mechanism (and infrastructure) or is that still to be defined?

It doesn't fit the current mechanism, since we would need an additional key (or key suffix).

The infrastructure for the new generalized cache is not yet defined. It would cover the functionality of the current parser cache (Memcached+SQL) and the Parsoid cache (currently Cassandra). It's not yet clear which of the two the unified mechanism would use, or if it should use something else entirely. So far, the generalized parser cache is just an idea. There is no plan yet.

daniel moved this task from P1: Define to Old on the TechCom-RFC board.Jul 11 2019, 2:27 PM
daniel added a subscriber: mobrovac.

Moved to the RFC backlog for improvement after discussion at the TechCom meeting. The proposed functionality seems sensible enough, but this ticket is lacking information about system design that is needed to make this viable as an RFC.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created. Which raises a number of questions, like:

  • what volume of data do you expect that store to hold?
  • should data ever be evicted? Does it *have* to be evicted?
  • how bad is it if we lose some data unexpectedly? How bad is it for all the data to become unavailable?
  • what's the read/write load?
  • what are the requirements for cross-DC replication?
  • what transactional consistency requirements exist?
  • what's the access pattern? Is a plain K/V store sufficient, or are other kinds of indexes/queries needed?

Also, so you have a specific storage technology in mind? In discussions about this, Cassandra seems to regularly pop up, but it's not in the proposal. As far as I know, there is currently no good way to access Cassandra directly from MW core (not abstraction layer, but apparently also no decent PHP driver at all, and IIRC there are also issues with network topology).

I was hoping for @Joe and @mobrovac to ask more specific questions, but they are both on vacation right now. Perhaps get together with them to hash out a proposal when they are back.

Moved to the RFC backlog for improvement after discussion at the TechCom meeting. The proposed functionality seems sensible enough, but this ticket is lacking information about system design that is needed to make this viable as an RFC.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created.

I guess the closest thing we have like it right now would be the parser cache system backed by MySQL.

Which raises a number of questions, like:

  • what volume of data do you expect that store to hold?

I can't talk in terms of bytes right now, but I we can add a bit of tracking to our current cache to try and figure out an average size and figure out a rough total size from that if that''s what we want.
If we are talking about number of entries, this would roughly line up with the number of wikidata entities, which is right now 58 million.

  • should data ever be evicted? Does it *have* to be evicted?

It does not *have* to be "evicted", but there will be situations where it is detected to be out of date and thus regenerated.

  • how bad is it if we lose some data unexpectedly?

Not very, everything can and will be regenerated, but takes time.

How bad is it for all the data to become unavailable?

Unavailable or totally lost?
Unavailable for a short period of time would not be critical.
Unavailable for longer periods of time could have knock on effects to other services such as WDQS not being able to update fully once T201147 is complete, but I'm sure whatever update code is created would be able to handle such a situation.

Totally loosing all of the data would be pretty bad, it would probably take an extreme amount of time to regenerate at a regular pace for all entities.

  • what's the read/write load?

Write load once the job is fully deployed would be roughly the wikidata edit rate, but limited / controlled by the job queue rate for "constraintsRunCheck".
This can be guesstimated at 250-750 per minute max, but there will also be de-duplication for edits to the same pages to account for there.
If more exact numbers are required we can have a go at figuring that out.
Currently the job is only running on 25% of edits.

Read rate can currently be seen at https://grafana.wikimedia.org/d/000000344/wikidata-quality?panelId=6&fullscreen&orgId=1
On top of this the WDQS updaters would also be needing this data once generated.
This would either be via a http api request which would likely hit the storage, or this could possibly be sent in some event queue?

  • what are the requirements for cross-DC replication?

Having the data accessible from both DCs (for the DC failover case) should be a requirement.

  • what transactional consistency requirements exist?

Not any super important requirements here.
If we write to the store we would love for it to be written and readable in the next second.
Writes for a single key will not really happen too close together, probably multiple seconds between them.
Interaction between keys and order of writes being committed to the store isn't really important.

  • what's the access pattern? Is a plain K/V store sufficient, or are other kinds of indexes/queries needed?

Just a plain K/V store.

Also, so you have a specific storage technology in mind? In discussions about this, Cassandra seems to regularly pop up, but it's not in the proposal. As far as I know, there is currently no good way to access Cassandra directly from MW core (not abstraction layer, but apparently also no decent PHP driver at all, and IIRC there are also issues with network topology).

For technology we don't have any particular preferences, whatever works for the WMF, ops and tech comm.
Ideally something that we would be able to get access to and start working with sooner rather than later.

I was hoping for @Joe and @mobrovac to ask more specific questions, but they are both on vacation right now. Perhaps get together with them to hash out a proposal when they are back.

More than happy to try and hash this out a bit more in this ticket before passing it back to a tech comm meeting again.
It'd be great to try and make some progress here in the coming month.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created.

I guess the closest thing we have like it right now would be the parser cache system backed by MySQL.

I'd be interested in your thoughts on T227776: Generalize ParserCache into a generic service class for large "current" page-derived data.

daniel moved this task from Old to P1: Define on the TechCom-RFC board.Jul 31 2019, 5:27 AM
daniel added a subscriber: Krinkle.

Moving to under discussion. @mobrovac and @Joe said they had further questions. @Krinkle has some ideas as well.

Addshore added a comment.EditedAug 9 2019, 6:36 PM

Without digging into it too deeply that sounds like exactly what we need.

Moving to under discussion. @mobrovac and @Joe said they had further questions. @Krinkle has some ideas as well.

I'm keen to answer more questions and hear ideas!

I see that T227776 has had little activity since November, is there any way to move this forward?

I see that T227776 has had little activity since November, is there any way to move this forward?

It's stalled on "nobody needs this", so saying "I need this" may help move it forward :) Talk to @kchapman and @CCicalese_WMF for priorization.

We need this! :D (Seriously!)

Yes, we need this!
This decision is one of the 2 rfcs blocking continued work on constraint checks for Wikidata that was started in either late 2018 or early 2019. :)

Yes, we need this!
This decision is one of the 2 rfcs blocking continued work on constraint checks for Wikidata that was started in either late 2018 or early 2019. :)

Tagging Core Platform Team for consideration of a future initiative.

CCicalese_WMF added a subscriber: eprodromou.

@eprodromou and I agree that this is valuable work for a future initiative.

Krinkle updated the task description. (Show Details)Apr 3 2020, 10:29 PM
Krinkle updated the task description. (Show Details)Apr 3 2020, 11:05 PM
Task description

The constrain checks are accessible via 3 methods:

  • RDF action
  • Special page
  • API

[…] The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

If I understand correctly then, the RFC action is not a way to access the result of, nor to trigger, a WDQS request. Is that right?

Task description

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

Why not?

Task description

The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached by MediaWiki (WBQC) in Memcached. […] The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired […].
[…] We could make the Job […] that informs the Query service to pull the API to ingest the new data.

  • […] we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.
From T201147:

At the moment constraints violations are only imported to WDQS if they are cached the moment WDQS pulls the rdfs for constraint violations for an item. There is a race condition between the WDQS poller and the constraints check execution and this is why only a fraction of constraint violations are imported.

The above sounds contradictory to me, but I assume that must be because I'm misunderstanding something.

If I understand correctly, the authoritive source for describing items is Wikidata.org. The RDF Action on Wikidata.org exposes information relevant to constraint checks. The way we actually execute those contraint checks is by submitting a query to the Query service (WDQS), which has a nice relational model all the relationships and metadata etc. The thing that executes these checks is the MediaWiki WikibaseQualityConstraints extension (WBQC), and it caches the result for a day in Memcached.

So far so good, I think. But then I also read that Query service (WDQS) ingests the result of these checks (which it executed itself?), and that we want the Job to notify WDQS when it is best to poll for that so that it is likely a Memcached cache-hit.

I don't know why the result of this is stored in WDQS. But, that sounds to me like you already have a place to store them all?

Krinkle updated the task description. (Show Details)
Task description

The constrain checks are accessible via 3 methods:

  • RDF action
  • Special page
  • API

[…] The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

If I understand correctly then, the RFC action is not a way to access the result of, nor to trigger, a WDQS request. Is that right?

No

Part of the thing being stored may have been generated using data from a query to WDQS however.

Task description

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

Why not?

Sorry, it does set the cache, bur will not get the cache.

Task description

The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached by MediaWiki (WBQC) in Memcached. […] The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired […].
[…] We could make the Job […] that informs the Query service to pull the API to ingest the new data.

  • […] we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.
From T201147:

At the moment constraints violations are only imported to WDQS if they are cached the moment WDQS pulls the rdfs for constraint violations for an item. There is a race condition between the WDQS poller and the constraints check execution and this is why only a fraction of constraint violations are imported.

The above sounds contradictory to me, but I assume that must be because I'm misunderstanding something.

If I understand correctly, the authoritive source for describing items is Wikidata.org.

Yes

The RDF Action on Wikidata.org exposes information relevant to constraint checks.

Yes
For example https://www.wikidata.org/wiki/Q64?action=constraintsrdf
If this page appears blank then you'll need to hit https://www.wikidata.org/wiki/Special:ConstraintReport/Q64 first to generate and cache the results

The way we actually execute those contraint checks is by submitting a query to the Query service (WDQS), which has a nice relational model all the relationships and metadata etc.

Only some checks result in queries to the WDQS.
Other checks are done purely in PHP.

The thing that executes these checks is the MediaWiki WikibaseQualityConstraints extension (WBQC), and it caches the result for a day in Memcached.

Yes

So far so good, I think. But then I also read that Query service (WDQS) ingests the result of these checks (which it executed itself?), and that we want the Job to notify WDQS when it is best to poll for that so that it is likely a Memcached cache-hit.

I don't know why the result of this is stored in WDQS. But, that sounds to me like you already have a place to store them all?

WDQS is not really a store, for us currently it is 16 or so stores.
On top of that, it isn't a store it is a query service and has all kinds of baggage attached to it because of that.

Anyway, these results need to be accessible in mediawiki PHP code.
WDQS instances should also not be seen as persistent or consistent.
Part of the requirement of this task is that we have a copy of constraints for entities so that we can dump them and reload a WDQS instance.

Task description

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

Why not?

Sorry, it does set the cache, bur will not get the cache.

I don’t think it sets it either – as far as I’m aware the special page is completely oblivious to the cache. As to the “why”, it seemed useful at the time to have a way for users to get guaranteed-fresh constraint check results. (We could still set the cache in that case, of course, it’s just not implemented – and since the special page is very rarely used, it wouldn’t make much of a difference.)