Page MenuHomePhabricator

[Story] Make Special:EntityData be up to date after an edit
Closed, ResolvedPublic

Description

When an entity is edited users expect the data they get via Special:EntityData to change as well. We need to purge the caches there after an edit.

The approach that was decided during the task inspection was to "just not cache these specific page requests" T128486#6376066

Questions from story time:

How quickly do we want Special:EntityData to have the latest data

  • As close to immediately as possible without making it a month long project.

Is this specific for JSON only?

  • JSON is the most important on (that people talk about the most), but all formats would be good

Is this for all users to just the user that made the edit?

  • All users

Points:

  • This task only talks about calls to this page that do not specify a concrete revision ID
  • Task inspection fnotes rom 11 August 2020 T128486#6376066

Related Objects

Event Timeline

Does TitleSquidURLs require full list?
Because if have something like: https://www.wikidata.org/wiki/Special:EntityData/Q3361378.ttl?flavor=dump then ttl part can be a bunch of formats, and so can be flavor.

There's EntityDataRequestHandler::purgeWebCache which is supposed to do the purging and it uses EntityDataUriManager::getCacheableUrls but I don't see whether it handles flavors.

I did some quick testing and looks like action=purge does not indeed purge URL like https://www.wikidata.org/wiki/Special:EntityData/Q4115189.ttl?flavor=dump. This looks like independent bug.

If we cache them we should purge them. But I'm worried about the performance implications of sending a purge for every possible combination of parameters.

I hear the new varnish version (was it 4?) allows you to put multiple "variants" of a url into a single "bucket". That would help.

Yeah...you'd have to purge each variant of ttl and flavor parameter individually, at least right now.

This comment was removed by Smalyshev.
Lydia_Pintscher moved this task from incoming to ready to go on the Wikidata board.

Adding to the general Wikidata Bridge board, since this means users may see stale data when starting an edit (even though the page content will typically have the fresh data). We just discovered this on Beta:

Screenshot from 2019-10-01 16-05-32.png (698×723 px, 94 KB)

The Twitter hashtag value on Beta Wikidata was changed from WikidataCon to Wikidata; the infobox has the new value (was automatically updated through change dispatching), but the bridge dialog loaded the entity data via the special page and got a stale value.

(The termbox also loads Special:EntityData, but doesn’t have this problem because it always request the data for the mw.config.get( 'wgRevisionId' ) revision since T215786.)

Crazy idea suggested by a pragmatic fellow programmer: why don't we simply use our API if we don't want stale information (at least as a workaround)?

That’s a possible workaround, of course, but it causes additional network traffic and server load.

It would be great to be able to use the cached special entity page when we want the bridge to work at scale to avoid increased network and server load.
But for now in an MVP I see no reason that can't use the uncached wbgetentities API?
Or alternatively call an API to initially lookup the latest revid, then call the possible cached special entity data page (but that's more work)

Two notes:

  • On Special:EntityData, anything that has argument just passes through Varnish/ATS. So format=foo, revid=666, etc. All see the uncached version
  • We support too many formats but why not just cache invalidating of ttl of json for now? for Special:EntityData/Q666.json and Special:EntityData/Q666.ttl It's just two lines of code.

On Special:EntityData, anything that has argument just passes through Varnish/ATS. So format=foo, revid=666, etc. All see the uncached version

Are you sure about this? Because basically all of T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater hinges on the fact that requests for a specific revision can be cached in Varnish (we didn’t use ATS at the time).

And I’m getting a cache hit:

$ curl -svo/dev/null https://www.wikidata.org/wiki/Special:EntityData/Q1.ttl?revision=1116941900 2>&1 | grep -i '^< x-cache'
< x-cache: cp3052 miss, cp3052 hit/3
< x-cache-status: hit-front

We support too many formats but why not just cache invalidating of ttl of json for now? for Special:EntityData/Q666.json and Special:EntityData/Q666.ttl It's just two lines of code.

Well, I’d prefer to “do the right thing”, and not just invalidate the formats that happen to be requested most often. If we can afford to do this.

If we cache them we should purge them. But I'm worried about the performance implications of sending a purge for every possible combination of parameters.

So how many combinations do we actually have?

I think that’s all the variables, so that’s 20 URLs we would need to purge. Is that enough to be a problem?

WDoranWMF added subscribers: darthmon_wmde, WDoranWMF.

Untagging now based on discussion with @darthmon_wmde, please retag us if needed.

In order to prepare this for a story time I think we need to get rid of the idea of purging is the solution, and instead start back at the desired behaviour from a user point of view.
A BDD or two might help here.

Things that need to be considered:

  • Are we talking about both logged in and logged out users?
  • Are we talking about requests to Special:EntityData/Q70 for example, or also or only Special:EntityData/Q70.json?

I think then for this story we might be able to make a decision about what to do technically without needing to do a purge for example.
Right now in production the WMF is trying to do less purges, rather than more.
A result of that is also that we are not really in a position to be able to push forward with any xkey cache purging.

  • Are we talking about both logged in and logged out users?

Both.

  • Are we talking about requests to Special:EntityData/Q70 for example, or also or only Special:EntityData/Q70.json?

The requests I've seen were for .json I believe. It feels icky though to have different behavior for the different export formats though.

So from IRC, the % of requests to Special:EntityData that we are talking about here is around 3.7% of requests that currently hit when we might not want them to.

image.png (103×493 px, 52 KB)

Raw data @ https://docs.google.com/spreadsheets/d/1SIS5_Ch4JOj_9Fqi0JdmYcCnInYcOV-thhpRDRH9MVU/edit?usp=sharing
Generated with: P11066

One option might be to just not cache these pages in the first place.
This would result in ~800 more varnish cache misses per min for the special page.
This is nothing in comparison to the 10k per min requests that are not cached for the query service updater currently.

I can check with ops if this would be more desirable than extra purges (i believe it will be)

Addshore renamed this task from [Story] Purge Special:EntityData JSON after edit to [Story] Make Special:EntityData be up to date after an edit.Aug 11 2020, 12:47 PM
Addshore updated the task description. (Show Details)

Task inspection notes:

Possible 3 approaches:

  1. Don't cache these requests << Decided as the approach to try
  2. Invalidate the cache on the edit
    • How expensive is doing the cache invalidation on edits?
    • How many cache invalidations will occur after 1 edit?
      • 20 cache invalidations (5 formats, 4 flavours)
      • 20 invalidations * 1000 edits? = 20k invalidations a min potential?
    • Cache invalidation has to happen at multiple edge cache sites which = more time etc
  3. Make the requests without the revision id a temporary redirect to the page with a revision id << Decided as the 2nd place choice if we had to reevaluate later
    • Would this mean we don't cache the requests to the page with no revision id, then always send an up to date redirect, then point to a possibly cached page with revision id (YES)
    • Could potentially be a breaking change? depending on how users make their API requests?
  4. Don't cache the less used formats, do cache the more used formats and send purge requests (a combination of 1 and 2)
    • The motivation of this would be to send fewer purges on every edit than number 2

One interesting thing is that the current code seems to expect the opposite situation of what we now want to introduce:

EntityDataRequestHandler::outputData()
//FIXME: do not cache if revision was requested explicitly!
$maxAge = $request->getInt( 'maxage', $this->maxAge );
$sMaxAge = $request->getInt( 'smaxage', $this->maxAge );

// XXX: do we want public caching even for data from old revisions?
$maxAge  = max( self::MINIMUM_MAX_AGE, min( self::MAXIMUM_MAX_AGE, $maxAge ) );
$sMaxAge = max( self::MINIMUM_MAX_AGE, min( self::MAXIMUM_MAX_AGE, $sMaxAge ) );

At the time this was written (2013: I1dabe79261, I7298de0b9d), the expectation apparently was that eventually, we should only cache Special:EntityData requests without a revision ID, not ones with a revision ID.

I have a feeling that the original code didn't take into account how varnish/ATS works (or it worked differently back then)

Change 620318 had a related patch set uploaded (by Tarrow; owner: Tarrow):
[mediawiki/extensions/Wikibase@master] Cache Special:EntityData only if revision supplied

https://gerrit.wikimedia.org/r/620318

Change 620318 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Cache Special:EntityData only if revision supplied

https://gerrit.wikimedia.org/r/620318