Page MenuHomePhabricator

Store WikibaseQualityConstraint check data in persistent storage instead of in the cache
Open, Stalled, MediumPublic

Description

Currently the WikibaseQualityConstraints stores data in a memcached cache, but there is no guarantee of how long that data will remain as it can be evicted.
In order to make data more persistent we want to store it in it's own sql DB table.

The schema could probably be something like ( entityId, timestampLastUpdated, blob ).

TBA how much data will this currently add to the wikidatawiki DB?
TBA Can we use Cassandra for the actual blob storage? just keeping the index of entities and dates in sql?

What will this allow

  • The query service wants to be able to find constraint violations (T192565), so all entities will have to have constraint checks run (T201150) in order to have a complete set of data. There is no point in running the constraint check multiple times (if it drops out of the cache) hence the DB table.
  • This will allow dumps of all entity constraint checks which will enable easy reloading of data into a query service server without hitting the API a bunch.
  • The 'timestampLastUpdated' field will allow wikibase to inspect the oldest constraint check data and re run the checks for very old data points.

Acceptance Criteria

Related Objects

Event Timeline

Addshore renamed this task from Store WikibaseQualityConstraint check data in an SQl table instead of in the cache to Store WikibaseQualityConstraint check data in an SQL table instead of in the cache.Sep 11 2018, 7:59 AM

Some previous discussion of this, including suggested database schemas, can be found in the comments of T179849: Cache all constraint check results per-entity.

I have added this use case to the list of possible use cases for Cassandra, as storing the blobs in Cassandra instead of in SQL would make sense.

+1 for Cassandra, but one disadvantage might be that normal installation don't have the service.

mobrovac subscribed.

Could you describe the exact data that would be stored and its data model? I looked at T179849, but I see two different things there: in T179849#3770891 a resulting JSON blob is mentioned, but T179849#3778358 outlines a full SQL schema (with various indices) and I cannot relate the two.

If we are talking about a key/value JSON/blob storage, the Cassandra-powered key/value storage component might be a good fit for it, but at this point we need more info:

  • the aforementioned data model
  • access pattern: is it write-once, read multiple times, do overwrites happen, etc

+1 for Cassandra, but one disadvantage might be that normal installation don't have the service.

Ideally, once we have the storage component in place in our infrastructure, alternative MW interfaces ought to be developed that allow smaller-scale installation to use something else, which should resolve this concern.

This use case seems similar to caching parsoid HTML, which is done in RESTbase and backed by Cassandra. It's similar, because it's re-generated upon edit, and accessed from clients upon view, via an API. It's also similar in that losing this data is not absolutely critical, as it can be regenerated, but having to re-generate all of it may cause a problematic spike in load on application servers (and databases and the query service).

However, in contrast to the parsoid use case, information does not need to be stored for old revisions.

As to the model: the wikidata folks will have the details, but as far as I'm aware, it's a JSON blob for each Wikidata entity (items, properties, etc). Granularity could be increased to per-statement blobs.

Puring is, as far as I known, currently only done per edit of the subject. However, use cases for bulk purges exist (in particular, when constraints definitions change), but they are just ignored at the moment, as far as I know. I could be wrong about that, though.

This use case seems similar to caching parsoid HTML, which is done in RESTbase and backed by Cassandra. It's similar, because it's re-generated upon edit, and accessed from clients upon view, via an API. It's also similar in that losing this data is not absolutely critical, as it can be regenerated, but having to re-generate all of it may cause a problematic spike in load on application servers (and databases and the query service).

However, in contrast to the parsoid use case, information does not need to be stored for old revisions.

As to the model: the wikidata folks will have the details, but as far as I'm aware, it's a JSON blob for each Wikidata entity (items, properties, etc). Granularity could be increased to per-statement blobs.

Puring is, as far as I known, currently only done per edit of the subject. However, use cases for bulk purges exist (in particular, when constraints definitions change), but they are just ignored at the moment, as far as I know. I could be wrong about that, though.

If I understand the above correctly, we're saying that this is strictly key/value, where the key is an entity ID, and the value an opaque JSON blob. When the subject is edited, the value is overwritten with the most recent constraint check. And when the format of constraint definitions change, we need to be able to bulk purge previous entries in the obsolete format. Is this correct?

Some additional questions...

An opaque k/v store won't allow anything but discrete lookup by entity ID, how are violations queried? In other words, this seems to only be a small part of the larger model, what does that look like, and why are we creating this separation (i.e. what problem does this solve)?

Numbers regarding total number of entities, and the size of the values will be important of course, but perhaps most important will be some idea about access patterns. How frequently will entities be (over)written? How often read? I realize the answer to this is probably a distribution, and that this may involve some educated guess work.

What happens if constraint definitions change? Are we able to wholesale drop the older ones? Is the constraint check inlined on a miss, and is the latency (and additional load) under such circumstances acceptable? Or will some sort of transition be needed where we fall back to the older check when that's available, and replace them gradually?

I'll probably have more questions.

An opaque k/v store won't allow anything but discrete lookup by entity ID, how are violations queried? In other words, this seems to only be a small part of the larger model, what does that look like, and why are we creating this separation (i.e. what problem does this solve)?

These constraint violations will be loaded in the Wikidata Query service and queryable via SPARQL there.
The constraints results are also needed in mediawiki to expose to the user in the UI and APIs.
There is the possibility that we will need to provide dumps of all constraint violations in order to ease the loading of data into WDQS servers that are starting from scratch, but @Smalyshev would have to chime in there.

Numbers regarding total number of entities, and the size of the values will be important of course, but perhaps most important will be some idea about access patterns. How frequently will entities be (over)written? How often read? I realize the answer to this is probably a distribution, and that this may involve some educated guess work.

Some numbers I can put here right now are:

  • Wikidata has 50 million entities, each of which can have a violations report. Assuming we keep all of the violations for a single entity in a single blob, which would be the current plan, that would mean 50 million things to be stored.
    • If we wanted to store constraint data for statements individually as @daniel suggested above each of the current 544 million statements would be storing their own blob. The blob will of course be smaller than the blob for the whole entity, but the number of entries is much larger.
  • There are currently around 20,000 edits per hour on wikidata.org, so the number of writes to the storage should be that or less. The plan will be to have the regular running of these constraints be performed by jobs (T204031), in which case there will likely be some deduplication if an entity is edited multiple times in a short period.
  • Reads: I'll have to have a more careful look
  • Size: I'll have to look at this too

What happens if constraint definitions change? Are we able to wholesale drop the older ones? Is the constraint check inlined on a miss, and is the latency (and additional load) under such circumstances acceptable? Or will some sort of transition be needed where we fall back to the older check when that's available, and replace them gradually?

As mentioned above the constraint reports will be created post edit. The plan to use the job queue means in theroy if we purged all of the data, well, it would slowly rebuild itself, but the constraint checks can take some time, hence the request for more persistent storage than we are currently getting with memcached.

I'll probably have more questions.

I look forward to them

There is the possibility that we will need to provide dumps of all constraint violations in order to ease the loading of data into WDQS servers that are starting from scratch

If we plan to have permanent storage for it (not like now, load-on-new-edit model which ignores ones that haven't been edited recently) then yes, we'd need to either put violations into the main dump or create a separate dump for violations only.

I am afraid getting accurate numbers about size is hard.

Storing the RDF should be no problem:
https://www.wikidata.org/wiki/Q42?action=constraintsrdf

If we want to store the JSON to be able to completely remove memcache we should be able to store something like this:
https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=Q42

We should also mention that count and size of the violation reports are not really predictable, because they are derived from applying the constraint definitions to items (both are user curated).

We primarily need a key-value store, yes: given an entity ID, get the constraints data for that entity.

However, that data also includes some metadata on the validity of the constraints data, specifically a set of other entity IDs with their latest revision IDs at the time of the constraint check: if any of those entities have been edited since then (i. e., the latest revision ID is no longer the one stored in the metadata), the cached/stored results are stale and should not be used. This covers the case of constraint definitions changing (constraint definitions are statements on properties, so any change to the constraint definitions means a new revision on a property page), but also other changes (e. g. if there is a constraint “the value entity of this statement must have a statement X”, then the results are stale if the entity in the statement value was edited).

Currently, these metadata checks are done when constraint check results for an entity are requested, but ideally there would also be a way to mass-refresh all related constraint check results when an entity is updated. This might require storing that metadata in a different place.

This ticket is fantastic news.

It's probably completely out of the scope of this, but let's mention it anyway: I think it would be massively useful for revision scoring if we could have access to both the current violation report and the one for the previous revision. So that we could compute the new violations each edit introduced or solved. (Pinging @Tpt about that).

But I suspect storing the violations reports for all revisions of a given item might require too much storage? (For vandalism detection purposes I guess only the last few revisions should be needed, but you probably don't want to introduce an arbitrary cutoff…)

@Pintoch: Sounds good. Could you explain a bit more how you'd use the data? And what data specifically so we can see if we can make it happen?

@Lydia_Pintscher personally here is what I would concretely implement in the EditGroups tool. For each edit that is part of an edit group:

  • fetch the constraints violations before and after the edit (this fetching would happen as the edit is retrieved, so in near real-time)
  • compute the difference of constraints violations of each type (for instance, 1 new "value type constraint" violation and 2 less "statement required constraint" violation)
  • aggregate these statistics at a batch level and expose them in batch views (for instance, this batch added 342 new "value type constraint" violations and solved 764 "statement required constraint" violations)

Together with the number of reverted edits in a batch (which the tool already aggregates), this could potentially make it easier to spot problematic batches.

Other ideas of applications (that I would not write myself):

  • I am not involved in ORES development but I believe the statistics computed above (at edit level) could be useful for vandalism detection - if the constraints violations are already computed and cheap to retrieve, it might be much easier for them to rely on that.
  • I believe @Tpt could be interested in this as he has been working on detecting edits which introduce / solve constraint violations.
  • Maybe the Wikibase UI could show violations when viewing previous revisions of an item?
  • Or even show new/old constraints in Diffs? That would be epic.

Thanks a lot! That helps me understand it better and it definitely sounds awesome.
The problem is that not only the current item can be responsible for a violation. Violation causing and fixing on item X is not necessarily related to an edit on item X. Have you thought about this?

@Lydia_Pintscher yes indeed! For instance the aggregation at batch-level would probably not be meaningful for inverse constraints (unless there is a way to detect all the violations added and solved by an edit, not just on the item where the edit was made). But isn't this a problem that you have anyway, even when storing only the latest violations? For instance, if I add a "subclass of (P279)" statement between two items, don't you need to recompute type violations for all items which are instances of some transitive subclass of the new subclass? I am not sure how this invalidation is done at the moment.

Maybe it is too hard to store the entire violations reports for previous revisions, but it might be feasible for you to expose some statistics about the changes in the violations for each edit? If you have some clever invalidation logic to deal with changes of violations on other items, and if the invalidated violations are recomputed immediately after, it might even be doable to include these in the statistics?

That all sounds pretty hard to me and I don't want to hijack your thread!

Thank you @Pintoch for raising this idea. For my fixing constraints violations project, I can mine violations from history offline so I do not really need this feature.

About the vandalism detection (and maybe diffs) use case, what we could do and seems quite reasonable to me is consider the violations related to the revision R of the item I to be the violations of the statements of I with respect to the state of Wikidata just after R have been saved. I.e. we would consider as the violations for R what the current WikidataQuality extension outputs for I if it have been executed just after R is saved.

The advantage of this method is that it should not be too computationally expensive (we would just have to run and save the current constraints violation checks after each edit) but it may lead to bad artifacts. For example if someone vandalizes the item C1 by removing "C1 subClassOf C2" and then someone edits an item that have for type C1 by adding a property that have for constraints "type C2", this edit will be considered as introducing a type violation even if it should not. But we could hope that structural items like C1 are fairly carefully checked and so this problem will not affect too many edits.

I was thinking of the opposite: consider the violations related to the revision R of the item I to be the violations of the statements of I with respect to the state of Wikidata just before R+1 was saved.

Because for the current revision, you do want to keep invalidating the violations when other edits impact them - you don't want to display the violations as they were when the latest edit was done.

But that does not solve the issue: the scenario you describe can still happen in the opposite direction. But I agree it should not happen that often.

Addshore triaged this task as Medium priority.Oct 8 2018, 10:56 AM
Addshore renamed this task from Store WikibaseQualityConstraint check data in an SQL table instead of in the cache to Store WikibaseQualityConstraint check data in persistent storage instead of in the cache.Nov 1 2018, 2:28 PM

@mobrovac might it make sense for this to go through TechCom or some other decision process / rfc?

Essentially, we are talking here about dependency tracking for WikiBase/WikiData. We do need a larger solution to the dependency tracking problem all over our infrastructure, with WB/WD being the pathological case. At some point we will start seriously tackling that problem, but until we do, it makes perfect sense to me to have the constraints stored permanently.

I see certain questions that need to be resolved before any movement can be made on this front, such as:

  • will we store data only for the latest revision or not (implies different storage semantics and guarantees)
  • how will we detect key/value pairs that need to be invalidated (possibly efficiently)
  • do we benefit from having a multi-DC storage solution for this

There are probably more detailed questions that lurk there too. Given all of the above, yes, I think it would be good to have an RfC written up that explores these questions.

I think for the time being we’ll only store data for the latest revision. Storing constraint check results for older revisions could be useful from time to time (e. g. for T165722: Integrate constraint reports into recent changes and watchlist), but it’s not a requirement for now. (Even if we stored some results for older revisions, I think the most useful ones would be revisions of the last 30 days, not so much beyond that.)

will we store data only for the latest revision or not (implies different storage semantics and guarantees)

I agree with everything @Lucas_Werkmeister_WMDE said here, only store data for the latest revision.

how will we detect key/value pairs that need to be invalidated (possibly efficiently)

This invalidation logic is already contained within WikibaseQualityConstraints and currently used with memcached, checking if the stored value is still valid or rechecking and regenerating it.

do we benefit from having a multi-DC storage solution for this

As in mirrored between the DCs and retrievable in both rather than either in the case of a DC switch over regenerating the data or for services operating in the passive DC retrieving from the active?

I'll just put some points here:

  • We don't want to have to regenerate this data if we switch DCs
  • WDQS will be calling the MW api to essentially retrieve this data after it is generated, and WDQS is in both DCs so shorter distance to the data would be better.

There are probably more detailed questions that lurk there too. Given all of the above, yes, I think it would be good to have an RfC written up that explores these questions.

Should I write an TechCom-RFC for this in general to get the ball moving?

will we store data only for the latest revision or not (implies different storage semantics and guarantees)

I agree with everything @Lucas_Werkmeister_WMDE said here, only store data for the latest revision.

Ok, that makes things much simpler.

how will we detect key/value pairs that need to be invalidated (possibly efficiently)

This invalidation logic is already contained within WikibaseQualityConstraints and currently used with memcached, checking if the stored value is still valid or rechecking and regenerating it.

That's a bit out of scope for the data placing strategy itself, but I think this point needs more discussion.

do we benefit from having a multi-DC storage solution for this

As in mirrored between the DCs and retrievable in both rather than either in the case of a DC switch over regenerating the data or for services operating in the passive DC retrieving from the active?

I'll just put some points here:

  • We don't want to have to regenerate this data if we switch DCs
  • WDQS will be calling the MW api to essentially retrieve this data after it is generated, and WDQS is in both DCs so shorter distance to the data would be better.

Ok, so that's a definitive yes.

There are probably more detailed questions that lurk there too. Given all of the above, yes, I think it would be good to have an RfC written up that explores these questions.

Should I write an TechCom-RFC for this in general to get the ball moving?

That would be a good starting point. Please put some background into the the task as well as the proposed data model and interactions of WD with the cache.

Addshore changed the task status from Open to Stalled.Jun 27 2019, 12:24 AM

Stalled on the RFC

LSobanski subscribed.

Untagging SRE, please re-add once the task is unstalled.