Reduce / remove the aggessive cache busting behaviour of wdqs-updater
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Mar 8 2019, 1:54 PM

Description

wdqs-updater adds a nocache=<timestamp> parameter to request to wikidata. This is done to ensure that when processing recent changes, the data fetched from wikidata is not older than the update. This is quite aggressive and we should have a better way to ensure freshness.

This issue is raised as part of T217893, where an external wdqs-updater was generating a significant amount of traffic.

A few idea:

disable cache busting by default, enable it internally
use the event date instead of the current date as timestamp (would enable caching the fetch for the same event from multiple clients)
don't do cache busting on events older than X
back off when data received is older than the event

There are probably better ways to do this, but I'm not really sure what is available on the wikidata API to ensure freshness.

Details

Subject	Repo	Branch	Lines +/-
Add caching of Special:EntityData results	mediawiki/extensions/Wikibase	master	+98 -47
Enable revision fetches in production	operations/puppet	production	+1 -1
Allow revision dump for redirects	mediawiki/extensions/Wikibase	master	+9 -6
Enable revisions support on internal clusters	operations/puppet	production	+1 -1
Work around status 400 on redirect revision fetch	wikidata/query/rdf	master	+37 -1
wdqs: expose revision-fetch mechanism	operations/puppet	production	+12 -1
Enable using revision-fetch mechanism for test & internal clusters	operations/puppet	production	+10 -1
Implement more cache-friendly Wikibase fetch strategy	wikidata/query/rdf	master	+157 -21

Customize query in gerrit

Related Objects

Mentioned In: T128486: [Story] Make Special:EntityData be up to date after an edit
T240223: Step 1: Ensure that Wikidata Bridge uses fresh entity data (impact: high)
T199219: WDQS should use internal endpoint to communicate to Wikidata
T221407: Unrecognized subject messages in Updater
T218998: Track WDQS updater UA in wikidata-special-entitydata grafana dashboard
T218174: Add metric for skip-ahead revisions
Mentioned Here: T212550: Implement support for ChronologyProtection in events sent when editing Mediawiki/Wikidata
T105766: RFC: Dependency graph storage; sketch: adjacency list in DB
T185233: Modern Event Platform
T207837: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers
T210901: Stale reads for WDQS Updater
T217893: Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing

Event Timeline

Gehel created this task.Mar 8 2019, 1:54 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptMar 8 2019, 1:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looking at an internal version of the flavor=dump outputs of an entity, related observations:

Test request from the inside: curl -v 'https://www.wikidata.org/wiki/Special:EntityData/Q15223487.ttl?flavor=dump' --resolve www.wikidata.org:443:10.2.2.1

There is LM data, for this QID it currently says: last-modified: Fri, 08 Mar 2019 06:24:59 GMT
This could be used with standard HTTP conditional requests for If-Modified-Since. This would still cause a ping through to the applayer, but would not transfer the body if no change.
Or alternatively, use the same data that's informing the LM/IMS conditional stuff to set metadata in the dump output as well, so that your queries can use this as a datestamp that's shared among more clients (this is basically the use event date idea from the summary), so that it doesn't even need an LM/IMS roundtrip and can be a true cache hit.
The CC header is: cache-control: public, s-maxage=3600, max-age=3600
1H seems short in general. We prefer 1d+ for the actual CC times advertised by major cacheable production endpoints so that everything doesn't go stale too quickly during minor maintenance work on a cache or a site. Is there a reason (often it's set low because other issues around purging and this kind of update traffic not being well-engineered yet?).
However, assuming the 1H is staying for now, can't updaters just be ok with up to 1H of stale data and not cache bust at all? There's no such thing as async+realtime; there's always a staleness, it's just a question of how much is tolerable for the use-case.

We've been around this topic a number of times, so I'll write a summary where we're at so far. I'm sorry it's going to be long, there's a bunch of issues at play. Also, if after reading this you think it's utter nonsense and I'm missing an obvious solution to this please feel free to explain it.

Why we're using non-caching URLs?

Because we want to have the latest data for the item. The item can be edited many times in short bursts (bots are especially known for this, but people do it too all the time). This is peculiar to Wikidata - Wikipedia articles are usually edited in big chunks, but on Wikidata each property/value change is a separate edit usually, which means there can be dozens of edits in a relatively short period.

If we use static URL, and we get change for Q1234, we'd get the data for one of the edits stuck in the cache as "data for Q1234", and we have no way of getting most recent data until the cache expires. This is bad (later on that).

If we use URL keyed by revision number, that means if we have 20 edits in a row, we'd have to download the RDF data for the page 20 times instead of just downloading it once (or maybe twice). This is somewhat mitigated by batch aggregation we do, but our batches are not that big, so if there is a big edit burst, this completely kills performance (and edit busts are exactly the place where we need every last bit of performance).

can't updaters just be ok with up to 1H of stale data and not cache bust at all?

So, we can use one of two ways in general:
A. Use revision-based URLs (described above) - for these we can cache them forever, since they don't change
B. Use general Qid-based URL without revision marker. Long cache for this would be very bad, for the following reasons:

People expect to see the data they edit on Wikidata. If somebody edits a value and would have to wait for an hour for it to show up on WDQS, people would be quite upset. We can have somewhat stale data even now, but hour-long delay is rare. And when it happens, people do complain.

Updater is event-driven, so if it gets update for Q1234 revision X, it should be able to load data for Q1234 at least as old as revision X. If it loads, due to cache, any older data, this data is stuck in the database forever, unless there's a new update - since nothing will cause it to re-check Q1234 again.

Data in Wikidata is highly interconnected. Unlike Wikipedia articles, which link to each other but largely are consumed independently, most Wikidata queries involve multiple items that interlink to each other. Caching means that each of this item will be seen by WDQS as being in a state it was at some random moment at the past hour (note that it can also be these moments will be different for different servers due to cache expirations that can happen in between server updates) - with these moments being different for different items. That means you basically can't do any query that involves any data edited in the past hour reliably, as the result for any of these can be completely nonsensical - some items would be seconds-fresh and some items they refer to may be hour-old, producing completely incoherent results. And since it's not easy to see from a query which of the results may be freshly edited, this would reduce reliability of service data a lot. It may be fine on a relatively static database, but Wikidata is not one.

I am not sure we can get around this even if we delay updates - even if we process only hour old updates (and give up completely on freshness we have now) we can't know where the hourly caching window for each item started - that would depend on when the edits happened. One item may be hour old and another two hours old. Stale data would be bad enough, randomly inconsistently stale data would be a disaster.

So I consider static URL with long caching a complete non-starter unless somebody explains to me how to get around the problems described above.

The only feasible way I can see is to pre-process update stream to aggregate multiple edits to the same item over a long period of time and then do revision-based loads. Revision-based caching is safe with regard to consistency, and aggregation would mostly solve the performance issue. However, this means introducing an artificial delay into the process (otherwise the aggregation is useless) - which should be long enough to capture any edit burst on a typical item. And, of course, we'd need development effort to actually implement the aggregator service in a way that can serve all WDQS scenarios. We've talked about it bit but we don't currently have a work plan for this yet.

disable cache busting by default, enable it internally

This would immediately break all external updaters. They'd just pick up the first update in a bunch and ignore the rest, because of the caching.

use the event date instead of the current date as timestamp (would enable caching the fetch for the same event from multiple clients)

Timestamps are very bad identifiers, since they don't have enough resolution - many edits can happen in a second. Also, AFAIK there's no easy way to fetch revision by timestamp, only by revision ID. Also, see above about why we don't want to fetch by revision ID - this applies for timestamps too, even if they worked.

don't do cache busting on events older than X

This can work only if we knew that there's an edit for the same item newer than X. Otherwise of two edits older than X, you'd get the data for the first and the second would be ignored since it would fetch the data from the cache. Of course, if you knew there's a new edit you could just skip all the events before that edit altogether :)

back off when data received is older than the event

We already do, see T210901 and discussion around it, it's not fully implemented yet but that's the workaround we're using against inconsistent replication.

don't do cache busting on events older than X

This however gave me an idea. If we kept a map of all latest revision IDs for all items we've recently updated, we could eliminate a lot of stale updates - especially when we're catching up after the lag. The first mention of the item would fetch the latest rev, and then all the following events would basically be ignored.

Right now we do something like that within the batch, and again match the revision IDs against the database after the fetches - but this way we can do it cross-batch and eliminate the unnecessary fetches. Basically that'd solve the problem of lots of fetches (while the cache is active) since each item will be fetched only once per backlog. I think with proper data structure (like SparseArray maybe?) we could keep a lot of history there relatively cheaply (we just need one 64-bit int per item). Also probably won't work for changes that lack revision ID - like deletes - but we could either ignore those (they are relatively rare) or also use timestamps (dangerous).

Somewhat related to this issue: T207837. This ticket is about sharing the common workload of updating, including the fetches from wikidata, which would also reduce the load on wikidata.

Given the discussion above, I'm not sure I understand why a If-Modified-Since: would not work. What am I missing?

In the end, this looks like a more generic issue of processing events with some level of transactionality. It looks like some of this might be addressed in T185233 (more specifically in T105766 ?). I don't fully understand the exact goal of those tickets, but it at least make sense to raise the issue so that the WDQS use case is addressed if it can be addressed.

• ema triaged this task as Medium priority.Mar 12 2019, 2:06 PM

• ema moved this task from Backlog to Caching on the Traffic board.

I'm not sure I understand why a If-Modified-Since: would not work.

How would you see it working? In Varnish, it would be useless since Varnish has no way of knowing if Wikidata item changed since being cached. If we go to the backend, first we are already incurring all the costs of PHP setup, database connection and loading the data just to know whether the data has been edited or not. But there are more complications then that:

If-Modified-Since requires a timestamp. Timestamps (in seconds) do not have enough granularity to track changes in Wikidata - there can be many edits within one second.
The timestamp we have in the change even is not necessarily the timestamp on the database edit (we could probably ensure it's the same but due to the above it's useless anyway and we have to use revision IDs)
In most cases, we can already know if we have certain revision without calling Wikidata - revisions are monotonic, which means if we have revision X in the database and change comes with revision Y with Y<X, there's no point in calling Wikidata with this revision - we already have data better than this change event.

Thus, I don't see how and in which scenario using If-Modified-Since would be beneficial.

it at least make sense to raise the issue so that the WDQS use case is addressed if it can be addressed

For that, we better define "WDQS case". The best I have right now is the event aggregation idea described above. In theory, it could also be combined with data loading (so that Wikidata data is loaded only once per stream) though I am not sure how well it would work given that data sizes for some items are pretty large and combining huge data arrays with small events may be counter-productive. Some intermediate storage may solve the issue.

I think it would be better, from my perspective, to really understand the use-cases better (which I don't). Why do these remote clients need "realtime" (no staleness) fetches of Q items? What I hear is it sounds like all clients expect everything to be perfectly synchronous, but I don't understand why they need to be perfectly synchronous. In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries, which seems like a bulk data load/update process, not an "I just edited this thing and need to see my own edits reflected" sort of case.

Why do these remote clients need "realtime" (no staleness) fetches of Q items?

Because that's what Query Service is - realtime (well, near-realtime, given update times) queryable representation of Wikidata content in RDF form.

What I hear is it sounds like all clients expect everything to be perfectly synchronous,

Not sure what you mean by "synchronous" here, could you explain?

In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries

If you start with an old dump/starting point, yes, this is essentially a bulk data load. It is not meant for external clients, so there's no actual rate controls built in, except for the server itself rejecting the queries. Of course, rate limiting would mean the endpoint may take a very long time to catch up...
BTW in this situation I don't really see how caching would help any, as Updater is not supposed to ever ask for the same content (or, in a situation with large backlog, for the same item) twice. If we need more detailed look into it, maybe set up a meeting so we could do an interactive dive-in there?

Smalyshev mentioned this in T218174: Add metric for skip-ahead revisions.Mar 12 2019, 11:26 PM

In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries

It's not just Orange it would seem..
I took a quick look at the webrequest data for the WDQS updater UA and there are other locations also running the updater probably to keep their own copy of the query service up to date.
Looking at March 1st 2019 58% on the uncachable requests to Special:EntityData for wikidata came from internal wdqs systems, the other 42% came from what seems to be another 9 or so external copies of the wdqs that are kept up to date with the live data using the updater.

On 1st march there were 21,461,124 cache misses for the WDQS updater UA.
To put that in perspective the total number of cache misses for wikidata on that day was 23,857,783 (passes were 54 million)

Comparing this with the total number of of edits on wikidata in that day ~1.1 million, I see quite some uncached requests that could be cached.
I understand that there are some issues with the wdqs updating for every single revision due to the performance of writing to SPARQL however.
Taking a look at our internal hosts on the 1st march they seemed to make between 926,848 and 826,878 requests to Special:Entity data, so they avoided 200k - 300k requests to get data and thus also probably sparql writes each.

Thinking purely from a varnish hit rate perspective it would make sense to remove the "random cache busting" (i guess not actually random as it is ts based nocache=<timestamp>) from the request and switch to asking for the specific revision id that is required.
This would likely go from 21 million misses per day to just 1 million? (the initial 1 million requests to populate the cache for each revision being requested?)
After talking with Stas this apparently makes updating within the updater harder etc as it might result in more writes to sparql? (I'd let stas talk more on that topic).

Adding nocache=X doesn't actually mean the request will not be cached, it is still cached, just unlikely to be called by other users (probably wasting quite some varnish space?)
It looks like we probably get ~10k cache hits even with the cache buster from the WDQS UA, maybe if the servers happen to be requesting the same entities during the same second.
If we don't want to explicitly ask for a revision from the page, can we not use the latest revision id we know exists for the entity that we have, or some hash of it? to make for a nicer cache buster that could actually be shared among updaters both internal and external? The updating pattern within the wdqs itself could stay the same then? I guess this depends on if the wdqs updater knows what the latest revid is for the entity it is getting updates for?

Another thing to consider here is in theory even when using the cache buster method the data the wdqs updater currently gets when passing nocache=ts may not be up to date due to maxlag, not sure if that has been considered in the updater process at all?
It's not often that the maxlag but in the last months it has occasionally gone up to 5s or 20s (not sure if the wdqs updater normally requests data for an entity that quickly after an edit has been made? but if it does it could be getting out of dat data even with the cache busting. But perhaps the Last-Modified header is checked in the wdqs updater? if not, maybe it should be? (grepping through the code I couldn't find it)

On the Wikibase side of things, this is a relatively cheap request to make as the revision look up is done from the big shared cache of wikidata entity revisions and the flavour=dump so wikibase itself in most cases will not make any expensive sql queries etc (but anything mediawiki does on start up will still happen).

The only other thing I was going to add (forgot before i hit submit on the last post)

Within the cluster varnish cached results for entities return much faster than the php returned results (of course)

entity	varnish result	php result	page selection
Q1.ttl?flavour=dump	~0.06-0.07s	~0.6-0.7s	randomish
Q64.ttl?flavour=dump	~0.15-0.16s	~2.3-2.5s	randomish
Q100.ttl?flavour=dump	~0.13-0.14s	~2s	randomish
Q55886027.ttl?flavour=dump	~0.14s	~7-17s?	LongPages
Q2911127.ttl?flavour=dump	~~0.02s	0.06s	ShortPages

Data was gathered from a prod mw host with requests like the following

cat curl-format.txt
    time_namelookup:  %{time_namelookup}\n        time_connect:  %{time_connect}\n     time_appconnect:  %{time_appconnect}\n    time_pretransfer:  %{time_pretransfer}\n       time_redirect:  %{time_redirect}\n  time_starttransfer:  %{time_starttransfer}\n                     ----------\n          time_total:  %{time_total}\n

curl -w "@curl-format.txt" -o /dev/null -s "https://www.wikidata.org/wiki/Special:EntityData/Q2911127.ttl?flavour=dump&addshore=17"

I guess the wdqs internal machines would have comparable response times?

It's hard to really figure anything concrete out from this but the wdqs updater / updaters would potentially spend a lot less time waiting for responses (maybe they already do them async?) if they hit varnish more?

Doing some terrible maths and looking at the smallest possible time saving for a short page, so 0.04s saved by hitting the cache and assuming 1 million edits in a day (based on the comment above, even though right now the wdqs updater does a small amount of batching so makes less requests) 1000000*0.04 = 40,000s = =~11 hours per host?
This doesn't really help if the slowest part of the process is actually writing the data to blazegraph, but 11 hours in a 24 hour period is still pretty significant. I hope the Java updater does some amount of async work (writing to blazegraph while getting the next data ready?)

After talking with Stas this apparently makes updating within the updater harder etc as it might result in more writes to sparql? (

Yes because it would have to do SPARQL Update for each individual revision.

I guess the wdqs internal machines would have comparable response times?

You can see response times for RDF loading in the dashboard: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-24h&to=now&panelId=26&fullscreen

but 11 hours in a 24 hour period is still pretty significant

I'm not sure I understand how this figure was obtained but there's absolutely no way Updater spends half time in waiting for RDF loading. In reality, it spends most of its time in SPARQL Update.

I hope the Java updater does some amount of async work

All RDF is loaded in parallel of course (10 threads if I remember correctly). It should be relatively easy to see timings by yourself - just run the Updater with verbose logging (DEBUG level I think - -v option should do that).

writing to blazegraph while getting the next data ready?

That could be possible but doesn't happen now. May be a good idea to try. However, since SPARQL Update dominates the timings pretty heavily it's unlikely we'd save too much. And since we need to validate IDs against database (to ensure we don't already have the revision we're about to fetch) we can not fetch RDF before previous update has finished, thus reducing the parallelizeable part to essentially only Kafka data loading, which doesn't seem to be worth it.

Another thing to consider here is in theory even when using the cache buster method the data the wdqs updater currently gets when passing nocache=ts may not be up to date due to maxlag, not sure if that has been considered in the updater process at all?

Yes, see discussion at T210901 and T212550. TLDR: we know it happens, we have stopgap measure to counter it, but we haven't implemented the real solution yet.

Smalyshev added a project: User-Smalyshev.Mar 17 2019, 6:46 PM

In T217897#5026499, @Smalyshev wrote:

I guess the wdqs internal machines would have comparable response times?

You can see response times for RDF loading in the dashboard: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-24h&to=now&panelId=26&fullscreen

So, p95 is around 80ms, which lines up pretty well with the data in T217897#5020225 where a short page took 60ms.
I guess most entities are toward the smaller end of the scale as the mean seems to be closer to ~55-60ms.
Still, a cache hit for something that takes 80ms in php would likely only take ~25ms if hitting varnish.

but 11 hours in a 24 hour period is still pretty significant

I'm not sure I understand how this figure was obtained but there's absolutely no way Updater spends half time in waiting for RDF loading. In reality, it spends most of its time in SPARQL Update.

I'm not sure if the figure is totally accurate, it is based on multiple estimations.
Let me try and refine it slightly.

Again this ignores batching, but the actual edit count on the day was ~1.1 million, which resulted in between 926848 and 826878 requests to load entity data, depending on which wdqs host you look at, so 84-75% of edits end up triggering a entity data load with the current batching methods.

But the fact stands that varnish will always respond faster than php, and looking at even the smallest entity, a varnish hit shaves around 50% off the request time.
If we say a single wdqs host is making 800k requests to special entity data right now with an average load time of 55ms, thats 44000000ms or 12 hours spent loading data
If we pretend we are loading every single revision (so 1.1 million) and actually hit the varnish cache (well sometimes not if we are the first server to ask) then we have ((1100000/12*11) * 25ms ) + ( 1100000/12*1 * 55ms ) = 30250000 ms or 8.5 hours
So probably a saving of closer to 4 hours each day per instance of loading time. But if the updater were to actually then write to blazegraph for each of the retrieved revisisions then of course that would be 300k more updates, but IMO the wdqs updater can still request revisions like this, and choose not to actually write every single revision.

This is all generally meant to just highlight that hitting varnish is obviously going to be faster, even if the updater itself think that entity data retrieval is fast enough.

writing to blazegraph while getting the next data ready?

That could be possible but doesn't happen now. May be a good idea to try. However, since SPARQL Update dominates the timings pretty heavily it's unlikely we'd save too much. And since we need to validate IDs against database (to ensure we don't already have the revision we're about to fetch) we can not fetch RDF before previous update has finished, thus reducing the parallelizeable part to essentially only Kafka data loading, which doesn't seem to be worth it.

I'm still a bit confused about this logic inside the updater, especially with this id validation checking if we have the revision already etc?
The fastest way for this to work in the distributed fashion that that it is currently laid out in is to just retrieve every revision of entity data, using a varnish cacheable query string, hold the latest revision of an entity in some internal queue in the updater for a few second while waiting for more updates, and then just commit that to blazegraph for storage after a few seconds.
This means less reducing the php calls dramatically, increasing varnish hits, decreasing overall time spent waiting for special:entitydata responses, and still allowing for batching.

A few other comment that we might want to think about.

PHP is being hit very roughly with 12.5 million requests to turn some PHP object into RDF output for special entity data, we might want to just consider caching that in its own memcached key inside wikibase so we only have to do that conversion once per revision, reducing this logic from running 12.5 million times to just around 1.1 million times.
This hasn't been considered before because special entity data is cacheable, and these should all be varnish cache hits anyway, but if the updater behaviour does not change then maybe we should add this?

Also regarding 3rd parties using the updater, perhaps the revid based approach needs to be developed anyway to reduce the load that is likely to continue to increase. These should be hitting the cache, but they should also nt be getting out of date data, revid is the solution to that.

https://grafana.wikimedia.org/d/000000188/wikidata-special-entitydata shows the issue pretty well with the number of requests for uncached ttl data to special entity data.
In the last year the number of requests seem to have doubled, and the request rate doesn't look like it is slowing down, are we ready to double the requests using this uncachable method again in the next 12 months?

Addshore mentioned this in T218998: Track WDQS updater UA in wikidata-special-entitydata grafana dashboard.Mar 22 2019, 3:01 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Mar 25 2019, 4:06 PM

I'm still a bit confused about this logic inside the updater, especially with this id validation checking if we have the revision already etc?

Not sure what you mean "already". You can have revision ID in the change, and revision ID in Wikidata, but you still have to check against revision ID in Blazegraph, so that you do not replace newer data with older data.

hold the latest revision of an entity in some internal queue in the updater for a few second while waiting for more updates, and then just commit that to blazegraph for storage after a few seconds

Not sure how holding it in the queue for a few seconds would help anything. You'd just time-shift the whole process several seconds to the past, but otherwise nothing would change. If you mean batching the updates, we already to that. But the batch for the updates covering several seconds would be huge (some bots do hundreds of updates per seconds) and putting them into SPARQL queries would make them very slow. If we split them, we slow the process down, and take the risk the whole update was useless since new data already arrived. I am not sure how waiting for a few seconds helps anything beyond what current process is already doing (and introducing additional complexity, as now we can't anymore assume we're working with latest data but always have to track which delayed update this data relates to). Maybe I misunderstand something in your proposal.

This means less reducing the php calls dramatically, increasing varnish hits,

It may raise varnish hits (since everything would be varnish hit), but as for reducing PHP calls, I am not sure about that, because instead of fetching only newest edit, if the entry is edited 100 times, you now need to fetch 100 edits instead. That's 100x PHP calls.

PHP is being hit very roughly with 12.5 million requests to turn some PHP object into RDF output for special entity data, we might want to just consider caching that in its own memcached key inside wikibase so we only have to do that conversion once per revision

May be worth considering, but we have tons of revisions, do we have enough memory for such cache? some entries are huge, and if one letter changes in 30M RDF, we'd be storing two 30M revisions differing in one byte. Of course, we could limit the size of the cacheable RDF - not sure how many resources are cached.

Looking at the distribution of Special:EntityData fetches, if we cache entities under 10K, we will capture about 90% of them. Most frequent sizes are 1 to 4K. So caching probably worth trying.

Change 499363 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add caching of Special:EntityData results

https://gerrit.wikimedia.org/r/499363

gerritbot added a project: Patch-For-Review.Mar 27 2019, 12:26 AM

In T217897#5056213, @Smalyshev wrote:

I'm still a bit confused about this logic inside the updater, especially with this id validation checking if we have the revision already etc?

Not sure what you mean "already". You can have revision ID in the change, and revision ID in Wikidata, but you still have to check against revision ID in Blazegraph, so that you do not replace newer data with older data.

I'm not quite sure how you would get to this situation in the first place though?
Is the stream of events suddenly going to start sending old events?
Or is this mainly a situation after data has been bulk loaded?

hold the latest revision of an entity in some internal queue in the updater for a few second while waiting for more updates, and then just commit that to blazegraph for storage after a few seconds

Not sure how holding it in the queue for a few seconds would help anything. You'd just time-shift the whole process several seconds to the past, but otherwise nothing would change. If you mean batching the updates, we already to that. But the batch for the updates covering several seconds would be huge (some bots do hundreds of updates per seconds) and putting them into SPARQL queries would make them very slow. If we split them, we slow the process down, and take the risk the whole update was useless since new data already arrived. I am not sure how waiting for a few seconds helps anything beyond what current process is already doing (and introducing additional complexity, as now we can't anymore assume we're working with latest data but always have to track which delayed update this data relates to). Maybe I misunderstand something in your proposal.

Yes the waiting a few seconds would be for batching changes to the same entity. But this would be waiting on the stream of events for entity changes. wait until the entity has not been touched for 10 seconds (or something), then request the last revid that the updater received from special:entitydata using revid, and create the sparql and do the update.
I'm thinking about batched updates per entity, not batched updates of all changes in a set period of time.
Again, im mainly proposing this to try and get revid to be used, I still don't understand if above is essentially what the updater is doing, why revid can't be used, if I were going to write something to do updates to the query service from the ground up with no knowledge of what has already been attempted the above is what it would do.

This means less reducing the php calls dramatically, increasing varnish hits,

It may raise varnish hits (since everything would be varnish hit), but as for reducing PHP calls, I am not sure about that, because instead of fetching only newest edit, if the entry is edited 100 times, you now need to fetch 100 edits instead. That's 100x PHP calls.

Well, the underlying PHP calls that would happen as a result of hitting varnish would decrease dramatically even if every single revisions was requested using revid for ttl format, due to the current distributed nature of the updater.
If edits on wikidata were slower 1 edit on wikidata would result in 12 PHP runs using the cachbusting (so ignoring the batching)
Hitting revid 1 edit would result in 1, maybe 2 PHP hits, depending on how fast varnish was the cache the result.
Again ignoring the batching here as it definitely does not give us a 12x decrease in requests to php, that we would get with using a cachable url.

This is briefly backed up by data in T217897#5048178

"but the actual edit count on the day was ~1.1 million, which resulted in between 926848 and 826878 requests to load entity data"
That is per host.
So 1.1 million edits, but around 10 million PHP code executions (at least) to update the query services, when in my eyes that should really be no more than the # of edits.

PHP is being hit very roughly with 12.5 million requests to turn some PHP object into RDF output for special entity data, we might want to just consider caching that in its own memcached key inside wikibase so we only have to do that conversion once per revision

May be worth considering, but we have tons of revisions, do we have enough memory for such cache? some entries are huge, and if one letter changes in 30M RDF, we'd be storing two 30M revisions differing in one byte. Of course, we could limit the size of the cacheable RDF - not sure how many resources are cached.

So the shared cache for entity revisions inside wikibase exists per entity, not per revision, but it is updated during save and can be assumed to be the latest revision.
It is shared between wikidata.org, and all client sites and used for essentially all entity revision retrieval (we actually don't have numbers of the cache hit rate here, but I imagine it is pretty high...)
Even when retrieving an entity with a revision id the shared cache will be used, or at least checked to see if it contains the correct revision / latest revision to skip the DB call.

Specifically caching RDF in special:entitydata only make sense if we are going to continue hitting the page so much and skipping the varnish cache.
If we change the access pattern for the updaters to special:entitydata then the varnish cache is already that cache.

In T217897#5060748, @Smalyshev wrote:

Looking at the distribution of Special:EntityData fetches, if we cache entities under 10K, we will capture about 90% of them. Most frequent sizes are 1 to 4K. So caching probably worth trying.

I left some comments on the patch, but still think that the cache we are talking about there would be unnecessary if the wdqs just hit varnish.

the cache we are talking about there would be unnecessary if the wdqs just hit varnish.

It is problematic for WDQS to "just hit varnish", because varnish does not know if certain revision is the latest one available or not. Wikidata on the other hand does.

In T217897#5062728, @Smalyshev wrote:

the cache we are talking about there would be unnecessary if the wdqs just hit varnish.

It is problematic for WDQS to "just hit varnish", because varnish does not know if certain revision is the latest one available or not. Wikidata on the other hand does.

Again my flipped way of looking at that is that WDQS does know what the latest version of the entity that it is trying to get updates for is, therefor, it should probably just ask for it.

WDQS does know what the latest version of the entity that it is trying to get updates for is,

But "last version that WDQS knows of" can be very different from "last version that Wikidata has". That's the whole issue.

I had an idea recently though. Maybe we could work it in two modes - if the stream is lagged sufficiently, we use the "latest available" mode - to jump to the front, but if we're more or less current, the probability of our change being current is high, so we could use "by revision ID" mode. Need to look at edit timings to see if it's workable but may be splitting two cases - catching up from large lag and keeping current - would be more efficient and allows us to use cache for the most frequent case (which is "keeping current"). That would be easy to implement and test - just a couple of if's in proper places.

@Addshore btw do I understand right that constraints can not be fetched per-revision? In this case, do we still need cache-busting there? Or constrains manage their caches? I am not sure what to do here.

Change 499951 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Implement more cache-friendly Wikibase fetch strategy

https://gerrit.wikimedia.org/r/499951

In T217897#5066900, @Smalyshev wrote:

WDQS does know what the latest version of the entity that it is trying to get updates for is,

But "last version that WDQS knows of" can be very different from "last version that Wikidata has". That's the whole issue.

I had an idea recently though. Maybe we could work it in two modes - if the stream is lagged sufficiently, we use the "latest available" mode - to jump to the front, but if we're more or less current, the probability of our change being current is high, so we could use "by revision ID" mode. Need to look at edit timings to see if it's workable but may be splitting two cases - catching up from large lag and keeping current - would be more efficient and allows us to use cache for the most frequent case (which is "keeping current"). That would be easy to implement and test - just a couple of if's in proper places.

That sounds like a pretty good idea!
How often do the updaters get lagged behind the stream?

Another thing that we would also tweak would be the cache busting method. Right now a timestamp is used all the way down to a second.
If the cache buster had slightly less granularity (such as only timestamps ending in even seconds, or 0 and 5) the probability of hits in the same few seconds between different updaters within the cluster would be greatly increased.
But this would be a cherry on top, and if we already have the majority of requests using revid this probably isn't too bad.
I guess if this would work on not again depends on what the internals of the updater do and if this means for sure things might be out of date in places or if it would be able to handle this.

Another option that would be more involved would be have a single consumer of the stream do the hard work (generating SPARQL) and spit that back into another stream, so that wikibase is only hit once for each change rather than by each updater.

In T217897#5068290, @Smalyshev wrote:

@Addshore btw do I understand right that constraints can not be fetched per-revision? In this case, do we still need cache-busting there? Or constrains manage their caches? I am not sure what to do here.

Constraints can not be fetched per revisions, you can only get the latest version.
The constraint check results a single revision can change, so there is little point in tieing them to a revid.
When we get finished with the work in the area the results will be persistently stored, so retrieving them will be cheap, they will be calculated after each edit, persisted and then a stream be added to saying new constraint check data for X now exists.

Change 500359 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable using revision-fetch mechanism for test & internal clusters

https://gerrit.wikimedia.org/r/500359

Change 499951 merged by jenkins-bot:
[wikidata/query/rdf@master] Implement more cache-friendly Wikibase fetch strategy

https://gerrit.wikimedia.org/r/499951

Change 500359 merged by Gehel:
[operations/puppet@production] Enable using revision-fetch mechanism for test & internal clusters

https://gerrit.wikimedia.org/r/500359

Change 501056 had a related patch set uploaded (by Gehel; owner: Smalyshev):
[operations/puppet@production] wdqs: expose revision-fetch mechanism

https://gerrit.wikimedia.org/r/501056

Change 501056 merged by Gehel:
[operations/puppet@production] wdqs: expose revision-fetch mechanism

https://gerrit.wikimedia.org/r/501056

Smalyshev moved this task from Backlog to Doing on the User-Smalyshev board.Apr 3 2019, 11:41 PM

Looks like we have problem with redirects - they can not be fetched by-revision. E.g.:

https://www.wikidata.org/wiki/Special:EntityData/Q33044800.ttl?flavor=dump&revision=901802292

produces 400. Since @Addshore is on vacation, @Lucas_Werkmeister_WMDE maybe you know if there's a way to fix this? Looks like only the latest revision of redirect produces the error, previous ones are working fine. Looking at the code, it looks like fetching revision that is a redirect is for some reason forbidden. Not sure why.

I've also made a counter to check how many "forward skips" - i.e. loading revision further than we've asked in change - we get. The averages are between 0.1 and 0.5, sometimes going to 1 - i.e. we're saving up to one item fetch/update per second, or since we're processing about 10 updates per second, it's from 1% to 10% speed improvement. 1% is low, but 10% is not, so we may not want to give up on skip-ahead just yet.

Change 501450 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Work around status 400 on redirect revision fetch

https://gerrit.wikimedia.org/r/501450

Change 501450 merged by jenkins-bot:
[wikidata/query/rdf@master] Work around status 400 on redirect revision fetch

https://gerrit.wikimedia.org/r/501450

Change 502655 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Allow revision dump for redirects

https://gerrit.wikimedia.org/r/502655

Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Apr 10 2019, 10:16 PM

Change 502909 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable revisions support on internal clusters

https://gerrit.wikimedia.org/r/502909

Smalyshev moved this task from Incoming to Current work on the Wikidata-Query-Service board.Apr 11 2019, 12:41 AM

Change 502909 merged by Gehel:
[operations/puppet@production] Enable revisions support on internal clusters

https://gerrit.wikimedia.org/r/502909

Smalyshev moved this task from In review to Waiting/Blocked on the User-Smalyshev board.Apr 15 2019, 6:44 PM

Smalyshev moved this task from Waiting/Blocked to In review on the User-Smalyshev board.Apr 18 2019, 9:36 PM

Change 504990 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable revision fetches in production

https://gerrit.wikimedia.org/r/504990

Smalyshev mentioned this in T221407: Unrecognized subject messages in Updater.Apr 19 2019, 8:07 PM

Results of caching can be seen here:
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-30d&to=now&panelId=24&fullscreen&var-cluster_name=wdqs-internal

Deploy date is Apr 11, fetch time drops from 135/195 ms (eqiad/codfw) to 90/150 ms when requests are cached.

Change 502655 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow revision dump for redirects

https://gerrit.wikimedia.org/r/502655

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.3; 2019-04-30).Apr 23 2019, 2:00 PM

Smalyshev claimed this task.Apr 25 2019, 4:13 AM

Smalyshev removed a project: Patch-For-Review.Apr 26 2019, 9:17 PM

Change 504990 merged by Gehel:
[operations/puppet@production] Enable revision fetches in production

https://gerrit.wikimedia.org/r/504990

Smalyshev moved this task from In review to Done on the User-Smalyshev board.May 2 2019, 6:58 AM

Update now uses revision IDs everywhere for non-lagged fetches.

So, I did some really crappy analysis of the hit rate in varnish before and after this change, looking at the 5th of april and the 5th of may (one before and one after as far as I can tell).

SUMMARY	April 5th	May 5th
hit-front	1132	2986499
hit-local	3689	6068535
hit-remote	1	593755
int-local	6	0
int-remote	2	0
miss	14126950	1053539
pass	4360	8624
TOTAL	14136140	10710952

These include requests from IPs that start with 10.*.

For the 2 days looked at the hit rate for varnish has gone from 0.03% up to 90%, woo!

For reference I got the raw data with:

SELECT
 month, cache_status, COUNT(*) AS requests, ip
FROM wmf.webrequest
WHERE uri_host = 'www.wikidata.org'
AND year = 2019
AND ( month = 04 OR month = 05 )
AND day = 05
AND user_agent = 'Wikidata Query Service Updater'
AND uri_path RLIKE '^/wiki/Special:EntityData'
GROUP BY month, cache_status, ip
ORDER BY month, cache_status, requests DESC
LIMIT 9999

Smalyshev mentioned this in T199219: WDQS should use internal endpoint to communicate to Wikidata.Jun 4 2019, 4:09 PM

Will this change also get rolled out to 3rd parties using the updater? / Is it in a certain release?

No release yet, but if you check out Updater or WDQS build, you get the same behavior.

I guess this will eventually be in wdqs 0.3.3 ?

Eventually, yes.

Lucas_Werkmeister_WMDE mentioned this in T240223: Step 1: Ensure that Wikidata Bridge uses fresh entity data (impact: high).Dec 11 2019, 11:55 AM

Change 499363 abandoned by Addshore:
Add caching of Special:EntityData results