Page MenuHomePhabricator

Cache format constraint check results
Closed, ResolvedPublic

Description

Results of the “format” constraint, which we check on the Wikidata Query Service, are valid indefinitely, and can easily be cached.

NOTE: This task used to be a general “cache constraint check results” task, but as the discussion in it quickly focused on just the “format” result caching, I’ve re-purposed it. The general task is now T179839.

Event Timeline

Some thoughts/options for for caching:

  • web caches are easy to set up, but hard or impossible to purge
  • the result of a SPARQL based regexp evaluation could be cached forever in the WDQS web cache
  • the result of a SPARQL based instanceof check can be cached "for a while" in the WDQS web cache
  • using a web cache for the output of the wbcheckcontraints API module woud make it quite hard to purge the result
  • any cached constraint results for a given statement must be purged/discarded when the statement is edited (or when anything on the respective entity is edited)
  • We could use the object cache (BagOStuff and friends) to cache results for an entire entity.
    • The cache key could include the revision ID, which would enable automatic purging on every edit.
  • We could use the object cache (BagOStuff and friends) to cache results for individual statements
    • the cache key for a per-statement result could contain the statement hash, which would enable automatic purging when the statement changes
  • The object cache timeout should be low enough to allow changes to constraint declarations to take effect not too long after they are mode (hours? how many?)
  • Any constraints that use only local info can be cached until the entity (or statement) changes.
  • Constraints using global data should only be cached for a relatively short time. Unfortunately, these are also the ones that are expensive to evaluate.

Do you have some links to the docs perhaps?

Perhaps we want to prioritize caching compliant constraint check results over violations? I would think those are more likely to stay in place.

Change 373918 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseQualityConstraints@master] Cache regex check results

https://gerrit.wikimedia.org/r/373918

Lucas_Werkmeister_WMDE moved this task from incoming to in progress on the Wikidata board.

I’ve started with the simplest thing we can cache: regex check results, which are valid forever.

So change Iaaac950b48 implements caching on the most granular level we have, for the “format constraint” check: does arbitrary text match arbitrary regex? This is very common constraint type – practically all “external identifier” properties have one or more regexes to ensure that the identifier matches a certain format, many “Commons link” properties have regexes to assert that the link has a particular file name extension, and several “URL” properties check the protocol with a regex (e. g. only https?://.+). It is also, according to our statsd tracking, one of the most expensive constraint types. And as long as the regex flavor is stable, the result is valid indefinitely: it depends only on the text and the regex.

The question is whether it makes sense to cache this information, given the scale of Wikimedia’s object cache. I wrote a small script (P5921) to check how many format constraints each property has (it can have several), and how many statements, which should tell us how many combinations of text and regex there are to be cached (approximately). The script fails for one property (too many statements to count in WDQS), so I got the count for that property via another tool (see commit message for details). Overall, I estimate there are about one hundred million text-regex combinations for which we could cache the boolean result.

I assume that most of these combinations will never be hit (currently, only 193 users have the checkConstraints gadget enabled, and many items with “external identifier” statements will probably never be visited by any user with the gadget enabled), but still, that’s a huge number, and I have no idea if it’s sensible to cache that or not. Can someone from Performance-Team comment on this?

Pinging @aaron and @Krinkle as they are the experts on this.

Aaron suggested hashing, which is an excellent idea. Moving back to Doing.

Pinging @aaron and @Krinkle as they are the experts on this.

I left a few minor points at https://gerrit.wikimedia.org/r/373918.

I don't know much about the use of regular expressions in the particular context of Sparql and Wikidata, but it does leave me somewhat confused. If the regex itself was retrieved from the wiki database (not Sparql) and the input text is available in PHP, why do we need to query an external service to execute a regular expression?

Because checking arbitrary regexes on arbitrary input opens us up to DoS attacks via malicious regexes (e. g. catastrophic backtracking – runtime of checking (x+x+)+y on xxx..xxx is quadratic in input length). This came up during the security review of the extension (T101467: Ex: WikibaseQualityConstraints - remove or sanitize regex for FormatChecker), and we eventually decided to check regexes on an external service instead, where queries are subject to a timeout (T102752: [RFC] Workaround for checking the format constraint).

[..] why do we need to query an external service to execute a regular expression?

Because checking arbitrary regexes on arbitrary input opens us up to DoS attacks via malicious regexes [..]. This came up during the security review of the extension (T101467), and we eventually decided to check regexes on an external service instead, where queries are subject to a timeout (T102752).

Interesting. And this uses the same Sparql endpoint at https://query.wikidata.org/ as for public queries? I knew this was called from within Wikidata for MediaWiki API queries, but I didn't know it was used e.g. when saving edits (assuming validation happens there).

I'm bringing this up because the reason we want caching is due to its performance, right? ConstraintCheck/SparqlHelper in Wikidata currently has a timeout of 5 seconds. Presumably in order to support some of the more complex regular expressions and larger input texts. I imagine that, with a more performant backend, caching might not be as needed.

(off-topic: )

Response times look fairly good in Graphite (source). The minutely p99 over the last 5 days ranges from 100-500ms, although with regular spikes of 1-5 seconds. From manual testing I find it's usually ~350ms with short regex/text, and ~600-1200ms for more larger input. The only responses under 300ms are those that were a cache-hit in Varnish (yay) in which case it's typically 100-130ms. This is okay.

On the other hand, considering its for internal execution of a single regex, ~300ms is a lot. I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox? (below via eval.php):

$sandbox = new LuaSandbox; $sandbox->setCPULimit(0.25); $sandbox->setMemoryLimit(512000); // 512KB
$lua = "function match_text(text, pattern) return string.match(text, pattern) ~= nil end";
$sandbox->loadString($lua)->call(); $result = $sandbox->callFunction('match_text', 'foo', 'f..');
return is_array($result) && isset($result[0]) && $result[0] === true;

These lines take < 1ms. With large input (e.g. wikitext article), and a more complex pattern, between 5 and 10ms. However, I quickly realised Lua doesn't implement PCRE regular expressions by default (T52454, Lua Patterns). I wonder if it's worth revisiting the security implications of adding PCRE bindings to a LuaSandbox (perhaps reduced to a subclass for Wikibase).

Anyway, just a few off-topic thoughts :)

@Krinkle thanks for your comments!

And this uses the same Sparql endpoint at https://query.wikidata.org/ as for public queries?

Yup.

but I didn't know it was used e.g. when saving edits (assuming validation happens there).

No, constraint checks are currently a fairly “external” feature – they’re done when users visit an item, or after they’ve saved a statement, and only for users who have the checkConstraints gadget enabled. There are plans to enable the gadget by default (T173695), and to check constraints before saving, too (T168626), but we don’t plan to make that mandatory.

ConstraintCheck/SparqlHelper in Wikidata currently has a timeout of 5 seconds. Presumably in order to support some of the more complex regular expressions and larger input texts.

That’s not really the main reason – we also use the query service for type checking: is some item an instance of some class or a subclass of it? We try to do this check in PHP at first, but if that takes too long (e. g. because the type hierarchy is too deep, too branched, or even circular), we bail out and ask the query service instead. For just the regexes, I’d be fine with a much shorter timeout.

Response times look fairly good in Graphite (source).

Wow, I didn’t know about that thing. Nice! (Looks like I can configure the graph by tweaking the URL even though visiting https://graphite.wikimedia.org/ gets me an authentication error :) )

The minutely p99 over the last 5 days ranges from 100-500ms… This is okay.

By the way, what’s the round-trip time for memcached assuming a cache hit? If it’s, dunno, 100 ms, then the patch is probably not worth the trouble :)

(And @Jonas, perhaps we can switch the average times in the Grafana board from mean to median? Those graphite stats look a lot less serious than the Grafana board.)

I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses).

Hehe, I’ve actually written a service like that :) we have a test system for constraints on Cloud VPS (wikidata-constraints.wmflabs.org), and it doesn’t have enough RAM to run Blazegraph, so I wrote minisparql, which just listens for regex SPARQL queries and checks them with preg_match. Of course, if we were to do something like this on Wikidata itself, we could skip the SPARQL wrapping. But I have no idea how much trouble it is to deploy a new service like that…

Or perhaps using LuaSandbox?

Yeah, the different pattern flavors are a problem… the situation is already unfortunate because WDQS doesn’t quite implement PCRE either, but at least it’s similar. Lua Patterns seem to be completely different. (We’re stuck with PCRE-like patterns because there’s also an external service, KrBot, which checks constraints daily and saves the results in WD:Database Reports/Constraint Violations, and we need to be (mostly) compatible with it.)

If want to avoid flooding cache with rarely used long-tail combinations, maybe something like this could be done:

    $textHash = hash( 'sha256', $text );
    $cacheMap = $this->cache->getWithSetCallback(
		$this->cache->makeKey(
			'WikibaseQualityConstraints', // extension
			'regex', // action
			'WDQS-Java', // regex flavor
			hash( 'sha256', $regex )
		),
		WANObjectCache::TTL_DAY,
		function ( $curCacheMap ) use ( $text, $regex, $textHash ) {
			// Initialize the cache map if not set
			if ( $curCacheMap === false ) {
				return [];
			}
			// Refresh triggered by hotTTR...
			// Add regex-text check result to top of the LRU map if present
			if ( isset( $curCacheMap[$textHash] ) ) {
				return [ $textHash => $curCacheMap[$textHash] ] + $curCacheMap;
			}
			// Get the regex-text check result
			$time = microtime( true );
			$value = (int)$this->matchesRegularExpressionWithSparql( $text, $regex );
			if ( ( microtime( true ) - $time ) < .010 ) {
				return $curCacheMap; // don't bother
			}
			// Add regex-text check to the bottom 3/8s of the LRU map
			$index = intval( count( $curCacheMap ) * 5/8 );
			$newCacheMap = [];
			$pos = 0;
			foreach ( $curCacheMap as $k => $v ) {
				if ( $pos == $index ) {
					$newCacheMap[$textHash] = $value; // inject
					++$pos;
				}
				if ( $pos++ >= 100 ) {
					break; // prune to size
				}
				$newCacheMap[$k] = $v; // preserve
			}
				
			return $newCacheMap;
		},
		[ 
			// Once map is > 1 sec old, consider refreshing
			'ageNew' => 1,
			// Increase likely of refresh to certainty once 1 minute old;
			// the most common keys are more likely to trigger this
			'hotTTR' => 60,
			// avoid querying cache servers multiple times in a request
			'pcTTL' => $cache::PROC_TTL_LONG
		]
    );
    
    $value = isset( $cacheMap[$textHash] ) 
        ? $cacheMap[$textHash]
        : (int)$this->matchesRegularExpressionWithSparql( $text, $regex );

Interesting idea! It feels a bit weird to implement logic like this on top of the cache (I thought that’s the cache’s job?), but you’re the expert :) it sounds like it makes a lot of sense, at least, since the set of regexes is mostly static and the set of values is highly dynamic, with some very commonly used values.

I think I’ll remove the “don’t bother” microtime check, though, since it seems that even for an extremely simple query like SELECT (1 AS ?x) {}, the query service rarely responds in less than 0.04 seconds, and never in less than 0.02 seconds (tested from a Cloud VPS system within the Eqiad cluster).

@aaron do you mind if I make that a separate commit and credit you as the git commit author?

Interesting idea! It feels a bit weird to implement logic like this on top of the cache (I thought that’s the cache’s job?), but you’re the expert :) it sounds like it makes a lot of sense, at least, since the set of regexes is mostly static and the set of values is highly dynamic, with some very commonly used values.

I think I’ll remove the “don’t bother” microtime check, though, since it seems that even for an extremely simple query like SELECT (1 AS ?x) {}, the query service rarely responds in less than 0.04 seconds, and never in less than 0.02 seconds (tested from a Cloud VPS system within the Eqiad cluster).

That seems to make sense.

@aaron do you mind if I make that a separate commit and credit you as the git commit author?

Sure.

One thing that could also be tweaked in that code is to track the last timestamp an entry was touched and prune ones older than a certain number of days. For busy regexes, that stops long-tail cruft from accumulating and reduce memcached I/O in bytes. Another thing to tune would be the max size.

One thing that could also be tweaked in that code is to track the last timestamp an entry was touched and prune ones older than a certain number of days. For busy regexes, that stops long-tail cruft from accumulating and reduce memcached I/O in bytes.

If I understand correctly, that would help if we had some regexes that are frequently used (so the entire cache map is never evicted), but only with less than 100 values, so the bottom of the cache map would be a handful of stale, rarely-used values. I don’t think that’s a likely scenario right now (though it might be something to think about if we significantly increase the max size).

Another thing to tune would be the max size.

Yeah, I’ll make it configurable (and add tracking to see how often the $textHash was found in the cache map).

@Krinkle wrote a while back:

On the other hand, considering its for internal execution of a single regex, ~300ms is a lot. I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox?

I think these options are worth investigating. A HTTP call to the query service is a lot of overhead, even if cached.

Change 379222 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Aaron Schulz):
[mediawiki/extensions/WikibaseQualityConstraints@master] Use per-regex cache map to cache regex check results

https://gerrit.wikimedia.org/r/379222

Change 373918 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Cache regex check results

https://gerrit.wikimedia.org/r/373918

Change 379222 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Use per-regex cache map to cache regex check results

https://gerrit.wikimedia.org/r/379222

WMDE-leszek claimed this task.

Reopening. This task is supposed to be for caching results in general, which isn’t done yet at all, though we had a lot of discussion on caching regex checks specifically here, which in hindsight should’ve been in a separate task. Also, IMO the regex caching isn’t done yet, since the Grafana stats are pretty unsatisfactory.

(Perhaps we should repurpose this task to be just about regex checking, open a new one for general caching, and reshuffle the parent tasks so that this one is a child of the new task?)

Reopening. This task is supposed to be for caching results in general, which isn’t done yet at all, though we had a lot of discussion on caching regex checks specifically here, which in hindsight should’ve been in a separate task. Also, IMO the regex caching isn’t done yet, since the Grafana stats are pretty unsatisfactory.

(Perhaps we should repurpose this task to be just about regex checking, open a new one for general caching, and reshuffle the parent tasks so that this one is a child of the new task?)

The cache hit rate will be low if their is low traffic (as of now with a few people using a JS tool), since not much will get populated and items will expire. Something like increasing the cache set() rate as the last-modified time of the cache entry increases would work. It would also result in much more entries stored that might not get a lot of use though.

I did a bunch of requests against https://www.wikidata.org/w/api.php?action=wbcheckconstraints&format=json&id=Q42&constraintid=P1476%24F24FF782-E994-4946-BEEC-104CC592534F, which checks a format constraint for “title”. It’s always the same regex and only a handful of different values (17). But while I could see a sharp rise in requests in Grafana corresponding to the times when I sent those requests (permalink), most of them are still cache misses. I’m not sure how to interpret that – it seems values aren’t entering the cache map very often?

I did a bunch of requests against https://www.wikidata.org/w/api.php?action=wbcheckconstraints&format=json&id=Q42&constraintid=P1476%24F24FF782-E994-4946-BEEC-104CC592534F, which checks a format constraint for “title”. It’s always the same regex and only a handful of different values (17). But while I could see a sharp rise in requests in Grafana corresponding to the times when I sent those requests (permalink), most of them are still cache misses. I’m not sure how to interpret that – it seems values aren’t entering the cache map very often?

What was the request rate and time window?

The time window is the one in the Grafana permalink. The two spikes are when I just fetched that URL with curl in a for i in {1..100} shell loop (synchronously), so it looks like those spikes are 100 requests over 2-3 minutes.

Probably hotTTR is way to high. It's really "expected time till refresh given 1 hit/sec". With 50/min, you'd get maybe 2 updates (new values) per regex. I'll put up a patch for that.

Change 385400 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/WikibaseQualityConstraints@master] Lower hotTTR in matchesRegularExpression() to raise hit rate

https://gerrit.wikimedia.org/r/385400

Change 385400 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Lower hotTTR in matchesRegularExpression() to raise hit rate

https://gerrit.wikimedia.org/r/385400

This change was deployed for about four hours yesterday, and during that time it seems to have improved the caching behavior a lot: based on Grafana data (permalink), the overall hit rate went from 8.5% to 60%.

awk -F';' '
$1=="hit" { hit+=$3 }
$1=="miss" { miss+=$3 }
ENDFILE { print 100*hit/(hit+miss) "%"; hit=0; miss=0; }
' without-change.csv with-change.csv

However, it may also have caused T179156: 503 spikes and resulting API slowness starting 18:45 October 26 :/

Lucas_Werkmeister_WMDE renamed this task from Cache constraint check results to Cache format constraint check results.Nov 6 2017, 5:30 PM
Lucas_Werkmeister_WMDE updated the task description. (Show Details)

(Perhaps we should repurpose this task to be just about regex checking, open a new one for general caching, and reshuffle the parent tasks so that this one is a child of the new task?)

Done. And with that, I think we can close this subtask. Regex result caching has now been deployed for some more time, and it clearly works: over the last 7 days, the cache hit rate (see script in previous comment) was 64%, which is far better than the previous 8.5%.

Thanks everyone for your help :)