Page MenuHomePhabricator

RFC: Store WikibaseQualityConstraint check data in persistent storage
Open, MediumPublic

Description

  • Affected components: WikibaseQualityConstraint (MediaWiki extension for wikidata.org), Wikidata Query Service.
  • Engineer for initial implementation: Wikidata team (WMDE).
  • Code steward: Wikidata team (WMDE).

Motivation

This RFC is a result of T204024: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache. Specifically, the request to create an RFC in T204024#4891344.

Vocabulary:

  • WikibaseQualityConstraints (WBQC, or MediaWiki), a MediaWiki extension deployed on www.wikidata.org.
  • Wikidata Query Service (WDQS, or Query service), the service at https://query.wikidata.org.

Current situation:

The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached in MediaWiki (WBQC) using Memcached with a default TTL of 1 day (84600 seconds).

The constrain checks are accessible via 3 methods:

The special page and API can be used by users directly. The API is called by client-side javascript whenever a logged-in user visits an entity page on www.wikidata.org, the JS then displays the results on the entity page.

The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired for the current entity. When the API retrieves data from the cache, the WBQC extension has logic built-in to determine if the stored result needs to be updated (i.e because something in the dependency graph has changed).

We are in the process of rolling out a JobQueue job that will re-run constraint checks for an entity post-edit, rather than on only on-demand by a user (T204031). This way, they are more likely to be in-cache when requested shortly after by the Query service. We could make the Job emit some kind of event that informs the Query service to pull the API to ingest the new data (T201147).
Loading and re-loading of data into the WDQS will also present the need to dump all constraint checks.

5,644 out of 5,767 properties on Wikidata currently have constraints that require a (cacheable) check to be executed. Of the roughly 54 million items, 1.85 million items do not have any statements, leaving 52 million items that do have statements and need to have constraint checks run. Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.

Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on Grafana.

Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.

Problem statement:

Primary problem statement:

  • Constraint check results need to be loaded into WDQS, but we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.

Secondary problem statements:

  • Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
  • Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.
Requirements
  • Data can be persistently stored for every Wikidata entity (after every edit).
  • Only current state (not historical state) needs to be stored
  • Data can be stored from MediaWiki / Wikibase
  • Data can be retrieved from storage from MediaWiki / Wikibase
  • Storage can be dumped (probably via a MediaWiki maintenance script) into a file or set of files (for WDQS loading)

Exploration

Proposal
  • Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
  • Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
  • When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
  • Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
  • Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
  • Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.

Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?
  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?
  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

Moving to backlog, since we are waiting for feedback from the RFC's author.

In order to better understand your needs, let me ask you a few questions:

  • Do we need/want just the constraint check for the latest version of the item, or one for each revision?

Currently there is only the need to store the latest constraint check data for an item.

  • How will we access such constraints? Always by key and/or full dump, or other access patterns can become useful/interesting in the future?

There are currently no other access patterns on the horizon.
(storing the data like this will allow us to load it into the WDQS and query it from there)

  • Given those values will only be updated via the jobqueue, we don't need active-active write capabilities in the storage, or you still want to be able to compute the check on-demand and thus a/a storage is recommendable?

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

I'm not sure how the "full dump" would need to work, but it would seem natural that such data would be fed to wdqs with the same mechanism that updates the items.

The main regular updates for the WDQS will come via events in kafka, and then the WDQS retrieving the constraint check data from an MW API in the same way that it retrieves changes to entities.
The "full dump" is needed when starting a WDQS server off from a fresh start.
The full dump could just be a PHP script that iterates through the storage for all entities and slowly dumps the data to disk (similar to our dumpJson or dumpRdf scripts, and similar to a regular MW dump)

Moving back to Inbox as the feedback is now provided and this is again in the court of TechCom

We currently still want to be able to compute the check on demand, either because the user wants to purge the current constraint check data, or if the check data does not already exist / is outdated.
It could be possible that later down the line we put the purging of this data into the job queue too, and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)
But that is not the immediate plan.

My point here is quite subtle but fundamental - if we can split reads and write to this datastore based on the HTTP verb, so that constraints would be persisted only via either
a - a specific job enqueued (by user request or by
b - a POST request
it would be possible to store these data in the cheapest k-v storage we have, the ParserCache. That would allow typically be cheaper and faster than using a distributed k-v storage like Cassandra, which I'd reserve for things that need to be written to from multiple datacenters.

I think we can easily have this only persist to the store via the Job or a POST.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.
Once the storage is fully populated this isn't even a case we need to think about.
Purges of the stored data would then happen via a post (similar interface to page purges).

Addshore triaged this task as Medium priority.Feb 21 2019, 8:46 AM

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

We can continue to generate constraint check data on the fly when missing in GETs and simply not put it in the storage.

you could trigger a job in that case. the job may even contain the generated data, though that may get too big in some cases.

Yes we could still trigger a job on the GET :)
Probably cleaner to just have it run again, this won't be a high traffic use case and will slowly vanish so no need to worry about the duplicated effort / wasted cycles.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

and once we have data for all items persistently stored in theory the user would never ask for an items constraint check and it not be there (thus no writing to the storage on request)

Once the storage is fully populated this isn't even a case we need to think about.

I don’t think these are true – I think it will still be possible that we realize after retrieving the cached result that it is stale, because some referenced page has been edited in the meantime, or because a certain point in time has passed.

That is a good point.
We could consider having different API behaviour for the GET vs POST, POST being the only one that would get fresh results.

We could also do a variety of other things, for example:

  • re run the constraint checks in the web request and then present them to the user
  • send an error to the client saying they are outdated and to retry later while running the checks in the job queue
  • serve the outdated results but trigger a job to update the stored results?

Perhaps we need to figure out exactly how we want this to appear to the user @Lydia_Pintscher so that we can figure out what we want to do in the background.

Just to be clear on this RFC regarding my above comment, we are not waiting for a reply from @Lydia_Pintscher here. The decision is to keep the behaviour for the user the same. This still allows us to only need to write to storage during POSTs and via the job queue.

Just poking this a few months down the line, as far as I know this still rests with TechCom for a decision if discussion has finished?

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

@Addshore if you think no further discussion is needed and this is ripe for a decision, I'll propose to move this to last call at our next meeting. I'll put it in the RFC inbox for that purpose.

Yup, it seems like it is ready.

In general, if you want TechCom to make a call on the status of an RFC or propose for it to move to a different stage in the process, drop it in the inbox, with a comment.

Gotcha

Side note: we discussed this use case at the CPT offsite. It seems like this would fit a generalized parser cache mechanism. This is something we will have to look into for the integration of Parsoid in MW core anyway, but it's at least half a year out, still.

See T227776: General ParserCache service class for large "current" page-derived data

Just a quick question: "this would fit a generalized parser cache mechanism" meaning it would fit into the existing parsercache mechanism (and infrastructure) or is that still to be defined?
Thanks!

Just a quick question: "this would fit a generalized parser cache mechanism" meaning it would fit into the existing parsercache mechanism (and infrastructure) or is that still to be defined?

It doesn't fit the current mechanism, since we would need an additional key (or key suffix).

The infrastructure for the new generalized cache is not yet defined. It would cover the functionality of the current parser cache (Memcached+SQL) and the Parsoid cache (currently Cassandra). It's not yet clear which of the two the unified mechanism would use, or if it should use something else entirely. So far, the generalized parser cache is just an idea. There is no plan yet.

daniel added a subscriber: mobrovac.

Moved to the RFC backlog for improvement after discussion at the TechCom meeting. The proposed functionality seems sensible enough, but this ticket is lacking information about system design that is needed to make this viable as an RFC.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created. Which raises a number of questions, like:

  • what volume of data do you expect that store to hold?
  • should data ever be evicted? Does it *have* to be evicted?
  • how bad is it if we lose some data unexpectedly? How bad is it for all the data to become unavailable?
  • what's the read/write load?
  • what are the requirements for cross-DC replication?
  • what transactional consistency requirements exist?
  • what's the access pattern? Is a plain K/V store sufficient, or are other kinds of indexes/queries needed?

Also, so you have a specific storage technology in mind? In discussions about this, Cassandra seems to regularly pop up, but it's not in the proposal. As far as I know, there is currently no good way to access Cassandra directly from MW core (not abstraction layer, but apparently also no decent PHP driver at all, and IIRC there are also issues with network topology).

I was hoping for @Joe and @mobrovac to ask more specific questions, but they are both on vacation right now. Perhaps get together with them to hash out a proposal when they are back.

Moved to the RFC backlog for improvement after discussion at the TechCom meeting. The proposed functionality seems sensible enough, but this ticket is lacking information about system design that is needed to make this viable as an RFC.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created.

I guess the closest thing we have like it right now would be the parser cache system backed by MySQL.

Which raises a number of questions, like:

  • what volume of data do you expect that store to hold?

I can't talk in terms of bytes right now, but I we can add a bit of tracking to our current cache to try and figure out an average size and figure out a rough total size from that if that''s what we want.
If we are talking about number of entries, this would roughly line up with the number of wikidata entities, which is right now 58 million.

  • should data ever be evicted? Does it *have* to be evicted?

It does not *have* to be "evicted", but there will be situations where it is detected to be out of date and thus regenerated.

  • how bad is it if we lose some data unexpectedly?

Not very, everything can and will be regenerated, but takes time.

How bad is it for all the data to become unavailable?

Unavailable or totally lost?
Unavailable for a short period of time would not be critical.
Unavailable for longer periods of time could have knock on effects to other services such as WDQS not being able to update fully once T201147 is complete, but I'm sure whatever update code is created would be able to handle such a situation.

Totally loosing all of the data would be pretty bad, it would probably take an extreme amount of time to regenerate at a regular pace for all entities.

  • what's the read/write load?

Write load once the job is fully deployed would be roughly the wikidata edit rate, but limited / controlled by the job queue rate for "constraintsRunCheck".
This can be guesstimated at 250-750 per minute max, but there will also be de-duplication for edits to the same pages to account for there.
If more exact numbers are required we can have a go at figuring that out.
Currently the job is only running on 25% of edits.

Read rate can currently be seen at https://grafana.wikimedia.org/d/000000344/wikidata-quality?panelId=6&fullscreen&orgId=1
On top of this the WDQS updaters would also be needing this data once generated.
This would either be via a http api request which would likely hit the storage, or this could possibly be sent in some event queue?

  • what are the requirements for cross-DC replication?

Having the data accessible from both DCs (for the DC failover case) should be a requirement.

  • what transactional consistency requirements exist?

Not any super important requirements here.
If we write to the store we would love for it to be written and readable in the next second.
Writes for a single key will not really happen too close together, probably multiple seconds between them.
Interaction between keys and order of writes being committed to the store isn't really important.

  • what's the access pattern? Is a plain K/V store sufficient, or are other kinds of indexes/queries needed?

Just a plain K/V store.

Also, so you have a specific storage technology in mind? In discussions about this, Cassandra seems to regularly pop up, but it's not in the proposal. As far as I know, there is currently no good way to access Cassandra directly from MW core (not abstraction layer, but apparently also no decent PHP driver at all, and IIRC there are also issues with network topology).

For technology we don't have any particular preferences, whatever works for the WMF, ops and tech comm.
Ideally something that we would be able to get access to and start working with sooner rather than later.

I was hoping for @Joe and @mobrovac to ask more specific questions, but they are both on vacation right now. Perhaps get together with them to hash out a proposal when they are back.

More than happy to try and hash this out a bit more in this ticket before passing it back to a tech comm meeting again.
It'd be great to try and make some progress here in the coming month.

Most importantly, the proposal assumes the existence of a "more permanent storage solution" which is not readily available. This would have to be created.

I guess the closest thing we have like it right now would be the parser cache system backed by MySQL.

I'd be interested in your thoughts on T227776: General ParserCache service class for large "current" page-derived data.

daniel added a subscriber: Krinkle.

Moving to under discussion. @mobrovac and @Joe said they had further questions. @Krinkle has some ideas as well.

Without digging into it too deeply that sounds like exactly what we need.

Moving to under discussion. @mobrovac and @Joe said they had further questions. @Krinkle has some ideas as well.

I'm keen to answer more questions and hear ideas!

I see that T227776 has had little activity since November, is there any way to move this forward?

I see that T227776 has had little activity since November, is there any way to move this forward?

It's stalled on "nobody needs this", so saying "I need this" may help move it forward :) Talk to @kchapman and @CCicalese_WMF for priorization.

Yes, we need this!
This decision is one of the 2 rfcs blocking continued work on constraint checks for Wikidata that was started in either late 2018 or early 2019. :)

Yes, we need this!
This decision is one of the 2 rfcs blocking continued work on constraint checks for Wikidata that was started in either late 2018 or early 2019. :)

Tagging Platform Engineering for consideration of a future initiative.

Task description

The constrain checks are accessible via 3 methods:

  • RDF action
  • Special page
  • API

[…] The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

If I understand correctly then, the RFC action is not a way to access the result of, nor to trigger, a WDQS request. Is that right?

Task description

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

Why not?

Task description

The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached by MediaWiki (WBQC) in Memcached. […] The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired […].
[…] We could make the Job […] that informs the Query service to pull the API to ingest the new data.

  • […] we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.
From T201147:

At the moment constraints violations are only imported to WDQS if they are cached the moment WDQS pulls the rdfs for constraint violations for an item. There is a race condition between the WDQS poller and the constraints check execution and this is why only a fraction of constraint violations are imported.

The above sounds contradictory to me, but I assume that must be because I'm misunderstanding something.

If I understand correctly, the authoritive source for describing items is Wikidata.org. The RDF Action on Wikidata.org exposes information relevant to constraint checks. The way we actually execute those contraint checks is by submitting a query to the Query service (WDQS), which has a nice relational model all the relationships and metadata etc. The thing that executes these checks is the MediaWiki WikibaseQualityConstraints extension (WBQC), and it caches the result for a day in Memcached.

So far so good, I think. But then I also read that Query service (WDQS) ingests the result of these checks (which it executed itself?), and that we want the Job to notify WDQS when it is best to poll for that so that it is likely a Memcached cache-hit.

I don't know why the result of this is stored in WDQS. But, that sounds to me like you already have a place to store them all?

Task description

The constrain checks are accessible via 3 methods:

  • RDF action
  • Special page
  • API

[…] The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.

If I understand correctly then, the RFC action is not a way to access the result of, nor to trigger, a WDQS request. Is that right?

No

Part of the thing being stored may have been generated using data from a query to WDQS however.

Task description

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

Why not?

Sorry, it does set the cache, bur will not get the cache.

Task description

The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached by MediaWiki (WBQC) in Memcached. […] The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired […].
[…] We could make the Job […] that informs the Query service to pull the API to ingest the new data.

  • […] we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.
From T201147:

At the moment constraints violations are only imported to WDQS if they are cached the moment WDQS pulls the rdfs for constraint violations for an item. There is a race condition between the WDQS poller and the constraints check execution and this is why only a fraction of constraint violations are imported.

The above sounds contradictory to me, but I assume that must be because I'm misunderstanding something.

If I understand correctly, the authoritive source for describing items is Wikidata.org.

Yes

The RDF Action on Wikidata.org exposes information relevant to constraint checks.

Yes
For example https://www.wikidata.org/wiki/Q64?action=constraintsrdf
If this page appears blank then you'll need to hit https://www.wikidata.org/wiki/Special:ConstraintReport/Q64 first to generate and cache the results

The way we actually execute those contraint checks is by submitting a query to the Query service (WDQS), which has a nice relational model all the relationships and metadata etc.

Only some checks result in queries to the WDQS.
Other checks are done purely in PHP.

The thing that executes these checks is the MediaWiki WikibaseQualityConstraints extension (WBQC), and it caches the result for a day in Memcached.

Yes

So far so good, I think. But then I also read that Query service (WDQS) ingests the result of these checks (which it executed itself?), and that we want the Job to notify WDQS when it is best to poll for that so that it is likely a Memcached cache-hit.

I don't know why the result of this is stored in WDQS. But, that sounds to me like you already have a place to store them all?

WDQS is not really a store, for us currently it is 16 or so stores.
On top of that, it isn't a store it is a query service and has all kinds of baggage attached to it because of that.

Anyway, these results need to be accessible in mediawiki PHP code.
WDQS instances should also not be seen as persistent or consistent.
Part of the requirement of this task is that we have a copy of constraints for entities so that we can dump them and reload a WDQS instance.

Task description

The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.

Why not?

Sorry, it does set the cache, bur will not get the cache.

I don’t think it sets it either – as far as I’m aware the special page is completely oblivious to the cache. As to the “why”, it seemed useful at the time to have a way for users to get guaranteed-fresh constraint check results. (We could still set the cache in that case, of course, it’s just not implemented – and since the special page is very rarely used, it wouldn’t make much of a difference.)

It's a shame that this has made its way into the "icebox" of the Platform Engineering Roadmap Decision Making board.
We (WMDE) are seemingly not in a place to make this decision ourselves and need collaboration from TechCom and I imagine Platform Engineering on this matter.
Potentially this could make for a good candidate for the new https://www.mediawiki.org/wiki/Proposal_for_a_Technical_Decision_Making_Process

It's a shame that this has made its way into the "icebox" of the Platform Engineering Roadmap Decision Making board.

To clarify: It's unfortunately not very clear, but "idebox" doesn't mean "we definitely won't do it". It just means "not currently on our roadmap". If it's important to someone, it can be taken out of the icebox. It's basically a matter of discussing priorities and collaboration between teams.

Potentially this could make for a good candidate for the new https://www.mediawiki.org/wiki/Proposal_for_a_Technical_Decision_Making_Process

Yes, but the first and foremost question would be the one of resourcing and stewardship. The process assumes that we know upfront who'd do the work.

I still find myself very confused by this task and imagine that others might be struggling as well.

I'll try to pick up the thread from before and continue to ask clarifying questions.

  • The authoritive source for describing items is Wikidata.org.
  • The authoritive source for describing the constraint checks is also on Wikidata.org.

What is the authoritive source for executing a constraint check if all caches and secondary services were empty? I believe this is currently in MediaWiki (WBQC extension), which may consult WDQS as part of the constraint check, where WDQS in this context is mainly used as a way to query the relational data of Wikidata items, which we can't do efficiently within MediaWiki so we rely on WDQS for that. This means to run a constraint check, WDQS needs to be fairly up to date with all the items, which happens through some sync process that is not related to this RFC. Does that sound right?

Speaking of caches and secondary services, where do we currently expose or store the result of constraint checks? As I understand it, they are:

  • Saved in Memcached for 1 day after a computation happens.
  • Exposed via action=rdf, which is best-effort only. Returns cache hit or nothing. It's not clear to me when one would use this, and what higher-level requirements this needs to meet. I'll assume for now there are cases somewhere where a JS gadget can't affort to wait to generate it and is fine with results just being missing if they weren't recently computed by something unrelated.
  • Exposed via Special:ConstraintReport/Q123, which ignores the cache and always computes it fresh.
  • Exposed via API action=wbcheckconstraints, which is the main and reliable way to access this data from the outside. Considers cache and re-generates on the fly as needed, so it might be slow.

It's not clear to me why Special:ConstraintReport exists in this way. I suspect maybe it is to allow for an external service to be a cache or storage of constraint check results without having to worry about stale caches, so it's bascially exposing the computation end-point directly. That seems fine. What are those external stores? I think that's WDQS right? So WDQS is used for storing relational items, but also for storing constraint data. If so, why not use that as the store for this task? (Also, how is that currently backfilled? Are we happy with that? Would a different outcome to this task result in WDQS no longer doing this this way?)

I suspect the reason we don't want to use WDQS for this is that you want to regularly clear that out and repopulate it from scratch, and ideally in a way that doesn't require running all non-memcached checks again which presumably would take a very long time. How long would that be? And how do we load these results currently into WDQS?

Having public dumps of these seems valuable indeed. Is this something that could be dumped from WDQS?

Responding to the main RFC question - using a permanent store logically owned by MediaWiki and populated progressively seems like a better direction indeed and would make sense. I'm not seriously proposing WDQS be used for this, rather I'm trying to better understand the needs through asking what it isn't serving right now.

Assuming a store will be needed, how big would the canonical data be in gigabytes? What would be the expected writes per second from the job queue for this. And the expected reads per second from the various end points?

Could it have a natural eviction strategy where MW takes care of replacing or removing things that are no longer needed, or would it need TTL-based eviction?

Depending on the answers to this, using Main Stash might work for this. Which is essentially a persisted and replicated cache without LRU/pressure eviction, which sounds like it would fit. This is currently backed by Redis and was until recently used for sessions. This is now being migrated to a simple external MariaDB cluster (T212129).

  • The authoritive source for describing items is Wikidata.org.
  • The authoritive source for describing the constraint checks is also on Wikidata.org.

What is the authoritive source for executing a constraint check if all caches and secondary services were empty? I believe this is currently in MediaWiki (WBQC extension), which may consult WDQS as part of the constraint check, where WDQS in this context is mainly used as a way to query the relational data of Wikidata items, which we can't do efficiently within MediaWiki so we rely on WDQS for that. This means to run a constraint check, WDQS needs to be fairly up to date with all the items, which happens through some sync process that is not related to this RFC. Does that sound right?

Yes, though I would add that WDQS is only used for certain kinds of constraint checks – mainly those that have to traverse the subclass of hierarchy. Constraint checks that only use the data of one or two items (this item should also have property X; the value of property Y should be an item that has property Z) don’t use the query service.

Or put differently – when checking constraints, we use the query service for data of some other items, but not for data of the item itself; that’s why it’s okay if the query service has slightly stale data when we check constraints of an item immediately after it has been edited. (I realized while writing this that that’s not entirely true, but hopefully T269859 won’t be too hard to fix.)

Speaking of caches and secondary services, where do we currently expose or store the result of constraint checks? As I understand it, they are:

  • Saved in Memcached for 1 day after a computation happens.
  • Exposed via action=rdf, which is best-effort only. Returns cache hit or nothing. It's not clear to me when one would use this, and what higher-level requirements this needs to meet. I'll assume for now there are cases somewhere where a JS gadget can't affort to wait to generate it and is fine with results just being missing if they weren't recently computed by something unrelated.

It’s mainly used by the query service updater, which adds constraint check results to the query service, best-effort as you say.

  • Exposed via Special:ConstraintReport/Q123, which ignores the cache and always computes it fresh.
  • Exposed via API action=wbcheckconstraints, which is the main and reliable way to access this data from the outside. Considers cache and re-generates on the fly as needed, so it might be slow.

It's not clear to me why Special:ConstraintReport exists in this way.

Historically, the special page predates the API (and also any form of caching); the main reason it still exists this way is just that we haven’t removed it yet. You can see on Grafana that it’s barely used, a dozen requests a day or so.

I’ll skip quoting the rest of the comment, but regarding using WDQS as the store itself, I guess it’s important that WDQS isn’t really one store – there are (I believe) about a dozen instances, distributed across eqiad and codfw, and they’re fairly independent from one another. They’re supposed to all contain the same data, since they update themselves from the same data source, but over time the number of triples still drifts apart slightly. So to use WDQS as the store, we would have to push the data to each instance. The store which this RFC asks for, I believe, should instead store the data only once, and then each of the query service updaters (each instance has its own update) can pull the data from there, and so can action=wbcheckconstraints.

If it would help I'm more than happy to talk this through from the product side in a call and I'm sure Lucas and Adam would be as well.

It looks like we would be able to use the ParserCache for this one T270710: Allow values other than ParserOutput to be stored in a ParserCache instance is done?

So I guess the scope of this RFC would then switch to, is it okay to use the parser cache for this?

@Ottomata This could also be a good candidate for storage in some persistent kafka stream in production?
Though we would likely also want to be able to back such a stream by SQL for 3rd party users of this extension.

It looks like we would be able to use the ParserCache for this one T270710: Allow values other than ParserOutput to be stored in a ParserCache instance is done?

Yes, but that's not on any roadmap currently. If you want it prioritized, please reach out to @Naike so she can slot it into our process.

So I guess the scope of this RFC would then switch to, is it okay to use the parser cache for this?

Conceptually I think this is the correct thing to do. But there are oprational considerations - if this adds a considerable amount of data to the storage that backs the parser cache, this needs to be planned. Scaling the parser cache backend storage is currently a very manual and slow process.

@Ottomata This could also be a good candidate for storage in some persistent kafka stream in production?

I'd be curious to see that... my understanding is that the processing may well happen in Kafka, but you may have to materialize the data in some other storage system for queries/lookups. But I'd be happy to learn that I'm wrong about that!

@Lucas_Werkmeister_WMDE would you be able to come up with some sort of estimation regarding the disk this would likely end up using?

@Addshore I've quickly read the task description but I have to admit I don't fully understand it yet (what gets stored where, etc.). Find me on IRC? or set up quick little meeting? :)

@WMDE-leszek will be tackling the topic of this RFC again shortly.