Page MenuHomePhabricator

Determine which API we should use to fetch Lexeme data from Wikidata when specified in the function-orchestrator
Closed, ResolvedPublic

Description

This task is a decision ticket for how we will access Lexeme objects (JSON blobs) from Wikidata into the Wikifunctions function-orchestrator service when needed. Apparently, there are multiple APIs with different availability/qualities, and we should select which one we're going to use.

Use Case
A user needs the proper plural of an English noun. For now, we assume that we know the LID, so we want to be able to get the JSON describing the Lexeme with a specific LID.

Questions to answer

  • Is there a way to register to be pinged if a Lexeme has been edited, so we can purge caches if necessary? Does this already exist for Commons and the Wikipedias?
  • Is there a way to register to be pinged if a Item has been edited? Is this the same method as Lexemes?
  • Given that Wikifunctions are running a MediaWiki extension (WikiLambda), is there a way for them to arrange for WikiLambda to retrieve Wikidata content (lexemes in particular) without crossing a network boundary?
  • What API options are there? Which one would work best for this use case?

Steps:

  • Understanding of what APIs exist
  • Input from WMDE about which API(s) they recommend, and pros and cons
  • Selection of said API
  • Confirmation that said API is appropriately accessible/cached from the k8s cluster

Notes
Background reading: https://www.wikidata.org/wiki/Wikidata:Data_access

Event Timeline

Background reading: https://www.wikidata.org/wiki/Wikidata:Data_access

Our use case is: imagine a user who needs the proper plural of an English noun. For now we assume that we know the LID, so we want to be able to just get the JSON describing the Lexeme with a specific LID.

Do we use the MediaWiki API? Do we use the Linked Open Data endpoint? Some other API?

Is there a way to register somehow to be pinged if that Lexeme has been edited, so we can purge caches if necessary? I think you have something like that for Commons and the Wikipedias.

Same thing for Items and QIDs, if it is the same answer, but Lexemes are our focus.

In Wikidata:Data_access, it says:

The following URL formats are used by the user interface and by the query service updater, respectively, so if you use one of the same URL formats there’s a good chance you’ll get faster (cached) responses:

https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266 (JSON)

Does that mean we would have to specify a specific revision to get the benefit of caching?

Here's another important question from the Wikifunctions team perspective: Given that we are running a MediaWiki extension (WikiLambda), is there a way for us to arrange for WikiLambda to retrieve Wikidata content (lexemes in particular) without crossing a network boundary?

Here's another important question from the Wikifunctions team perspective: Given that we are running a MediaWiki extension (WikiLambda), is there a way for us to arrange for WikiLambda to retrieve Wikidata content (lexemes in particular) without crossing a network boundary?

That's irrelevant for us, as these requests are coming from a network service (the function orchestrator) which runs on a different network, the service namespace on the main k8s cluster.

Actually, I raise this question in conjunction with considering the possibility of retrieving wikidata content directly from WikiLambda. It's a question that has been raised in the caching design document.

(Almost this entire answer applies to Lexemes, Items and Properties equally, so I’ll mostly just say “Entities” to cover them all.)

Do we use the MediaWiki API? Do we use the Linked Open Data endpoint? Some other API?

I think it would be best to use the Linked Open Data endpoint, i.e. Special:EntityData. (The action API, i.e. action=wbgetentities, gives you the same data format anyway; it lets you get multiple Entities at once, but if I understand the use case correctly, you won’t have more than one ID to look up at a time anyway.) Initially, you won’t know the latest revision ID of the Entity data to get (and I’m assuming the PHP side won’t know which Entity IDs will be looked up, so it can’t look up the revision IDs ahead of time), so you’ll get an uncached HTTP response; but I think during change dispatching (see below) you could probably have the revision ID, and then you could tweak the URL to use one of the cached formats. Using Special:EntityData in both cases should keep the code simpler.

Is there a way to register somehow to be pinged if that Lexeme has been edited, so we can purge caches if necessary? I think you have something like that for Commons and the Wikipedias.

Within Wikibase, we do this with change dispatching. Wikidata keeps track of which other wikis are subscribed to an Entity (this is shown in action=info), and then each wiki keeps track of which of its pages are subscribed to which aspects of which Entity (e.g. page X uses the English label, description, and “instance of” statement of item Y; this is also shown on action=info). When a change happens on Wikidata, it’s sent to each subscribed wiki, and the wiki then checks which of its pages are affected by the change (e.g. page X would not be updated if the “image” statement of item Y changed) and updates (re-renders) the affected pages. I think I’ve heard some rumors years ago about replacing this with a more general-purpose dependency tracking mechanism, but nothing concrete.

The most straightforward “pinging” approach would probably be to reuse this for Wikifunctions as well, and I don’t see any immediate problems with it. It’s a bit tricky that the Entity lookups will come from a network service (per T368654#9950064), which won’t directly have access to update Entity usage; I’m guessing there should either be an internal endpoint on the PHP side which the orchestrator can hit to add an Entity usage, or the orchestrator’s response should include all the Entities that were looked up as part of the metadata and the PHP code would add the Entity usage when processing the response. And we might need some hook to let WikiLambda do something custom rather than re-rendering a page.

I guess the biggest question there is: is the page-level granularity of the existing usage tracking mechanism a good fit for Wikifunctions? (I’m envisioning individual ZObject pages being registered as the subscribers of Wikibase Entities.) Or would you need something else?

(Aside: if you use Special:EntityData, you always get the full Entity data, but you wouldn’t necessarily access all of it; and preferably, if the function only accesses e.g. the English label of an Item, we would also want to only track that usage. We do this in Lua with some metatable magic, where we return the full data but then automatically track which members are actually accessed; something similar should be possible for Wikifunctions too, I assume, and then it would be the orchestrator’s responsibility to report which aspects of the Entity were really accessed. But that’s probably not a high priority for the moment, because Lexeme usage tracking is currently also not fine-grained at all – I kicked that can down the road in T235901, and it hasn’t come up again yet, so currently WikibaseLexeme Lua access always tracks “the whole Entity was used”.)

Thanks, @Lucas_Werkmeister_WMDE ! Extremely helpful. Work is proceeding on some orchestrator code to retrieve lexemes via the Linked data interface. It would be useful to know if there are any guarantees about the structure of a Lexeme returned as JSON. For example, is it certain there will always be an id, language, lexicalCategory, and at least one lemma? Can there be a lexeme with no forms? If there are no forms, will there still be a forms property with an empty list? If there are zero claims, will there still be a claims property with an empty object?

I'm assuming there are probably minimal guarantees, which we can work with of course, but if there's any additional info, it will be helpful. So far I haven't seen anything that looks like contractual language for the API, and I'm not sure where to look for a formal schema. Is there anything like that?

You are probably looking for https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model

There is always an ID, language and lex. category, Lemma.
There can be Lexemes without Forms and/or Senses.

Hi @Jdforrester-WMF, have you got all the info you need from us here?

Hi @Arian_Bozorg , I think we are likely to have occasional follow-up questions over the next 1-3 months, if that's okay.

Hi again @Lucas_Werkmeister_WMDE - Above you recommended the Linked Open Data endpoint, rather than the action API (action=wbgetentities). But in fact we are considering that at some point we might want to retrieve multiple entities in one call. So I'm wondering: if we revisit this choice, are there any factors to consider besides minimizing our code complexity? For example, does either of the 2 APIs generally give faster responses? Does either of them allow greater flexibility in the use of change dispatching? Do both of them allow to use the revision ID, if and when Wikifunctions has it? (And any other pertinent differences that you know of would be helpful.)

For example, does either of the 2 APIs generally give faster responses?

I wouldn’t expect them to; they do pretty much the same work. (There might theoretically be a systematic performance difference if we still use separate database and/or app server groups for index.php and api.php? No idea.) wbgetentities lets you filter the data returned using the props parameter, but that only reduces the amount of data sent over the wire and slightly reduces the CPU cost of adding the data to the response – the full Entity data is still loaded and deserialized first regardless of the props, so I assume it doesn’t make a huge difference. (Also, most of the props don’t apply to Lexemes anyway.) The biggest difference is going to be that Special:EntityData responses are potentially cached. (How often they’re actually cached might be worth investigating. TTL RDF responses for latest revisions should have a pretty high cache hit rate AFAIK, because they’re practically guaranteed to have been requested by the query service updater; for JSON responses it probably depends more on whether the Lexeme was recently visited by a human editor or not.)

Does either of them allow greater flexibility in the use of change dispatching?

I don’t think so.

Do both of them allow to use the revision ID, if and when Wikifunctions has it?

No, only Special:EntityData allows that. wbgetentities always returns the data of the latest revision. (Except potentially in cases of DB replication lag, I suppose.)

(And any other pertinent differences that you know of would be helpful.)

wbgetentities has a few other parameters that can be useful in general (e.g. languagefallback), but probably not for your use case. I can’t think of anything else right now.

No problem, I'll leave the ticket open so we can discuss here

In Wikidata:Data_access, it says:

The following URL formats are used by the user interface and by the query service updater, respectively, so if you use one of the same URL formats there’s a good chance you’ll get faster (cached) responses:

https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266 (JSON)

Does that mean we would have to specify a specific revision to get the benefit of caching?

Special pages are not cached at the edge, so there is no caching for that url, independently of indicating a revision or not:

$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate
$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate

If you want to use change dispatching from wikidata, which is a proven mechanism, then you'd probably be better off keeping the lexeme data within the wiki structure, and pass it to the orchestrator as a parameter of the function. That would allow you to re-parse the wiki page using the normal mechanism that's already established for wikis, and solve a lot of problems for you (including I think how to fetch the items, but I'd let @LucasWerkmeister comment on that).

If this is not possible for some reason, then I'd consider maybe taking a look at our Event Platform and if it's possible to get events including new revision content for wikidata as well. If you need specific data, you would probably benefit from the prior art of the WDQS Updater, which also guarantees not overwhelming wikidata with requests.

Also take a look at the new Search Update Pipeline (Added @dcausse and @pfischer as subscribers.)

Just an FYI, Event Platform maintains a mediawiki.page_change.v1 event stream in Kafka, as well as a mediawiki.page_content_change.v1 stream which has (unparsed) wiki content in it. (I'd guess the content version you need is not the raw wikidata content tho).

Special pages are not cached at the edge, so there is no caching for that url, independently of indicating a revision or not:

$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate
$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate

Some caching seems to be happening:

$ curl -Is 'https://www.wikidata.org/wiki/Special:EntityData/Q42.json'  | grep x-cache
x-cache: cp3073 miss, cp3073 pass
x-cache-status: pass
$ curl -Is 'https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266'  | grep x-cache
x-cache: cp3073 miss, cp3073 hit/1
x-cache-status: hit-front

Special pages are not cached at the edge, so there is no caching for that url, independently of indicating a revision or not:

$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate
$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate

Some caching seems to be happening:

$ curl -Is 'https://www.wikidata.org/wiki/Special:EntityData/Q42.json'  | grep x-cache
x-cache: cp3073 miss, cp3073 pass
x-cache-status: pass
$ curl -Is 'https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266'  | grep x-cache
x-cache: cp3073 miss, cp3073 hit/1
x-cache-status: hit-front

Yeah that is slightly surprising, anyways we won't allow the orchestrator to make calls to the edge for security and reliability reasons. All requests will be mediated via the service mesh.

An update after the meeting with SRE, Security and the wikifunctions people:

  • All requests will need to be mediated via the service mesh. This means that your requests should be configurable like you do for calls to wikifunctions, where you can set both the URL to call and the Host: header you will be sending. I see you already have a setting for WIKIDATA_API_URL; you would need to add an additional setting similar to what you do with wikifunctionsVirtualHost, and then use it to set the header.
  • Given there's no caching of results in the orchestrator, we need to limit the number of wikidata api calls from a function quite restrictively. We should have a global configurable limit so that we don't create too much of a multiplication vector. The current practical limit is set by the fact each function can make at most 1 call, and that a composed function has a recursion limit of 100, is too high to be sustainable in production.

Thanks for the guidance, @Joe ! These changes have now been made in MR !229 (renamed to Retrieve wikidataVirtualHost from env. and pass to ReferenceResolver). The new setting is called wikidataVirtualHost, and is handled identically to wikifunctionsVirtualHost. The corresponding environment variable is WIKIDATA_VIRTUAL_HOST.

Yeah that is slightly surprising

I guess your “slightly surprising” is our “pretty important for site performance” ;) I don’t know if we track the cache hit rate anywhere, but the total request volume to Special:EntityData is in the millions per day. If you’re not using that cache, then I agree you probably want a more restrictive limit on those calls 👍

Yeah that is slightly surprising

I guess your “slightly surprising” is our “pretty important for site performance” ;) I don’t know if we track the cache hit rate anywhere, but the total request volume to Special:EntityData is in the millions per day. If you’re not using that cache, then I agree you probably want a more restrictive limit on those calls 👍

What was slightly surprising was that i got non-caching headers the other day, not that we rightfully cache an immutable url at the edge :)

But yes, internal application should not use the CDN cache ever - first of all, their patterns of access can be very different from anything else and do cache poisoning, and also it creates a double dependency on the CDN in change propagation.

Per team discussion, our current state resolves our needs at least for the next few months. To discuss further with SRE in our fortnightly meetings.

Change #1075554 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2024-08-13-135124 to 2024-09-24-145528

https://gerrit.wikimedia.org/r/1075554

Change #1075554 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2024-08-13-135124 to 2024-09-24-145528

https://gerrit.wikimedia.org/r/1075554

This comment is to notify everyone that Wikifunctions is planning to start use wbgetentities for some upcoming use cases, and also eventually will switch from Special:EntityData to wbgetentities for the current use case mentioned in the Description.

Reasons:

  • One anticipated use case involves fetching multiple lexemes at once (supported by wbgetentities).
  • Another use case involves fetching items, and we'd like to be able to use the props parameter of wbgetentities.
  • Switching from Special:EntityData to wbgetentities allows for a useful merging of 2 different fetching procedures that we have.

@Lucas_Werkmeister_WMDE , based on your answers above, it's my understanding that it's fine for us to use either one, from the Wikidata perspective. Just thought it would be good for you and others to know that we are going to start using wbgetentities, and have a chance to make additional comments on that.

Our code that uses wbgetentities got deployed recently and is working well. Since there were no replies to my previous comment, I assume there were no concerns about Wikifunctions code adopting this approach. One note: fetching a single entity from Special:EntityData appears to be a bit faster than fetching a single entity from wbgetentities. This is just based on a small number of tests using curl commands, but the results didn't vary very much.
Resolving.