Page MenuHomePhabricator

[Investigation - 8h] technical overview of current db based Wikibase federation & blockers to get to an API based federation
Closed, ResolvedPublic

Description

So far we have made it possible to re-use Wikidata's items and properties inside Wikimedia to make statements.
That's one type of federation.
It will be used for example to make statements on Commons using Wikidata properties and items.
It is not currently used anywhere else.
For reference this federation is not currently enabled in production, it is about to be enabled on the beta cluster T204748.

The same feature is very much in demand by users outside Wikimedia. We should make this possible.

Part of this task:

  • get an overview of the current implementation
  • make suggestions for how to expand this for use outside Wikimedia and identify blockers/hurdles

Hints

Event Timeline

Addshore triaged this task as Medium priority.Nov 20 2018, 7:59 AM
Michael renamed this task from [Investigation] technical overview of Wikimedia-internal federation to [Investigation - 8h] technical overview of Wikimedia-internal federation.Nov 20 2018, 2:42 PM
Addshore renamed this task from [Investigation - 8h] technical overview of Wikimedia-internal federation to [Investigation - 8h] technical overview of current db based Wikibase federation & blockers to get to an API based federation.Nov 20 2018, 3:04 PM
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)

As far as I’m aware, our current federation is not only restricted by the requirement of database access, but also by the fact that each entity type is bound to one repository, and it is not possible to mix entities of the same type from multiple repositories. I suspect many external users will also require that restriction to be removed or relaxed (typically to use local and Wikidata items together); is that part of this task as well?

As far as I’m aware, our current federation is not only restricted by the requirement of database access, but also by the fact that each entity type is bound to one repository, and it is not possible to mix entities of the same type from multiple repositories.

Correct.

I suspect many external users will also require that restriction to be removed or relaxed (typically to use local and Wikidata items together); is that part of this task as well?

hmm, arguable this is a separate question, and also of different kind. I mean that how to deal with, say, items from three different places might be also becoming a more "fundamental" conceptual question. The limitation for current (limited indeed) version of federation to require shared DB access with wikidata production seem to be "only" technical problem. So I would see arguments to keep those questions separate.

That said, to be clear, I think the other question seem to be more relevant for non-Wikimedia federation uses I could imagine/I've heard about. So it is without a doubt an important question. Might deserve own investigation, or series of such.

Yeah. It's definitely something that will come up. But let's take it one step at a time. As the next step not mixing entities from several repositories for the same entity type seems good enough.

What is Needed to make MultipleRepositoryAwareWikibaseServices use an API?

Case Study: PropertyInfoLookup

  • Interface: \Wikibase\Lib\Store\PropertyInfoLookup
  • 3 relevant implementations:
    • CachingPropertyInfoLookup
    • DispatchingPropertyInfoLookup
    • PropertyInfoTable
  • Only PropertyInfoTable actually looks up data, the other two are require a lookup for themselves
  • MultiRepositoryWiring.php creates a DispatchingPropertyInfoLookup and supplies it with an array mapping repo names to service instances
  • => possible general approach: add HttpPropertyInfoLookup or something for remote Wikibase instances
    • there needs to be an database call available for that service to call
    • how do we handle a service that is unreachable?
    • think about caching
    • "push notifications" for changed data?

What is already configured?

what further configuration may be necessary?

  • api url
  • api keys? user/pass?

Further considerations

  • blocking/blacklisting Wikibase installations that want to federate
    • option to whitelist them instead?
  • how to handle multiple Wikibase installations containing the same type of entities - e.g. Items or Properties?
  • tracking requests from federated Wikibase installations

Next Steps

  • Decide:
    • Option 1: most Services get a HTTP variant
      • Think about what API they need and what we already have
      • Think about caching.
      • Query 25 federated instances on every keystroke that expects auto-complete?
    • Option 2: A local replica of the relevant tables of the federated databases?
      • Pro: Could mostly be used with existing services, fast
      • Con: Could be very resource intensive independent of actual usage
    • Option 3: A mix of the above?

I don’t understand this part:

  • blocking/blacklisting Wikibase installations that want to federate
    • option to whitelist them instead?

The federation setup is part of the site configuration (I assume), so where would a blacklist or whitelist apply? If the site admin doesn’t want federation with a particular Wikibase instance, they just shouldn’t configure it. Or is this a blacklist/whitelist on the target repo, so that e. g. Wikidata would refuse to be a foreign repo for Metapedia?

[...] Or is this a blacklist/whitelist on the target repo, so that e. g. Wikidata would refuse to be a foreign repo for Metapedia?

Yes, that was the usage I thought of. How do we handle badly configured instances that target our current instance? Rate limiting might be another possible way to go here. OTOH, it might be enough to handle this with the existing anti-DOS measures.

Hm, I don’t think that’s something we need to deal with at the Wikibase level… I feel like this is something that’s best dealt with at the HTTP proxy level. Apart from the other repo’s IP address (that the request is coming from), the current instance doesn’t necessarily know anything about the other repo anyways, does it?

Some more responses, then:

how to handle multiple Wikibase installations containing the same type of entities - e.g. Items or Properties?

Not in scope for this task, if I understand correctly (see also T209880#4778278).

api url

To clarify: the way I currently envision this, API-based federation would use the current Wikibase APIs (mostly wbgetentities, I assume). Is that also what you have in mind, or do you want to create some dedicated API or use something else?

api keys? user/pass?

It’s probably enough to make all requests anonymously (I hope). (Authenticating the requests would make tracking easier, though.)

Query 25 federated instances on every keystroke that expects auto-complete?

Currently, we would only query one federated instance, because any entity type can only be provided by a single repository (the local one or a federated one). That will change in the future, but even then, I don’t think many installs will have more than two or three repos for the same entity type (local item, Wikidata item, perhaps something in between?).

Decide:

I would vote for option 1 (HTTP variants of services), but I’m not sure what the decision process for this is going to be anyways.

Currently, we would only query one federated instance, because any entity type can only be provided by a single repository (the local one or a federated one). That will change in the future, but even then, I don’t think many installs will have more than two or three repos for the same entity type (local item, Wikidata item, perhaps something in between?).

That depends on how fractured the ecosystem is going to be. If many glam organizations start hosting their own wikibase but somewhat agree on a common itemtype (say Artifact:A12345), then we might face a lot of instances. That might be especially true, if we make spinning up instances very easy, e.g. by creating "wikibase hub" or something similar to the wordpress.com model for wikibase with GLAM defaults preconfigured. Also third parties might be create something like wikia.com .

api url

With this line, I just wanted to point out that we currently have only the entity url configured e.g. https://wikidata.beta.wmflabs.org/entity/, but we may want to explicitly configure the url for the api instead of "recalculating" it, since they may be significantly different.

the way I currently envision this, API-based federation would use the current Wikibase APIs (mostly wbgetentities, I assume). Is that also what you have in mind, or do you want to create some dedicated API or use something else?

Honestly, I'm not sure and I think that depends on how we model the other components. If we can make use of all the information wbgetentities provides and it scales well, then that might be enough.


In general, I get the feeling that building a prototype for this and noting all the ways where that fails might be a good way to get a reliable list of the things we have to build right. Maybe, let's take a small team and a few weeks or a trail blaze for this?

api url

With this line, I just wanted to point out that we currently have only the entity url configured e.g. https://wikidata.beta.wmflabs.org/entity/, but we may want to explicitly configure the url for the api instead of "recalculating" it, since they may be significantly different.

I think there essentially needs to be an api endpoint for a repo to generate a federation config that can then be used on another repo, instead of people manually clobbering together various different URLS and configs etc.
The repo knows all of the settings needed to connect to iut, so just have it do it :)

the way I currently envision this, API-based federation would use the current Wikibase APIs (mostly wbgetentities, I assume). Is that also what you have in mind, or do you want to create some dedicated API or use something else?

Honestly, I'm not sure and I think that depends on how we model the other components. If we can make use of all the information wbgetentities provides and it scales well, then that might be enough.

Right now we probably want to use Special:EntityData instead of wbgetentities, as right now the api has no varnish level caching, but Special:EntityData does.
We may also want a new API. I guess this depends on the granularity of the data updates.
If we want to provide more granularity, we add complexity, retrieving the whole entity / thing that has been updated is always going to be the easiest thing.


In general, I get the feeling that building a prototype for this and noting all the ways where that fails might be a good way to get a reliable list of the things we have to build right. Maybe, let's take a small team and a few weeks or a trail blaze for this?

This is on the road map for next year as far as I know, so something like this will probably happen

how to handle multiple Wikibase installations containing the same type of entities - e.g. Items or Properties?

Not in scope for this task, if I understand correctly (see also T209880#4778278).

Indeed, the initial version will not allow multiple repo for the same entity types.

api keys? user/pass?

It’s probably enough to make all requests anonymously (I hope). (Authenticating the requests would make tracking easier, though.)

In terms of tracking I guess UAs that make sense would do.
However, in terms of federation between other 3rd party wikibases it might be that they desire some sort of extra level of control not needed by wikimedia itself.

Query 25 federated instances on every keystroke that expects auto-complete?

Currently, we would only query one federated instance, because any entity type can only be provided by a single repository (the local one or a federated one). That will change in the future, but even then, I don’t think many installs will have more than two or three repos for the same entity type (local item, Wikidata item, perhaps something in between?).

I think that is probably true, we will probably only be talking about a couple of federated repos.
Anyway, if a repo does want to federate to 20, and cause JS calls to 20 repos etc, thats up to them, but i guess we won't design it to make that super nice initially.

Decide:

I would vote for option 1 (HTTP variants of services), but I’m not sure what the decision process for this is going to be anyways.

I would also vote for HTTP variants of services.