Page MenuHomePhabricator

Set-up Citoid behind RESTBase
Closed, ResolvedPublic0 Estimated Story Points

Description

The contents of results returned by Citoid are good candidates for storing in RESTBase as they are (mostly) deterministic and consequently immutable. Moreover, that would speed up obtaining results for previous Citoid queries considerably.

See T103811: Public API endpoints for new services for the proposed API endpoints. The workflow would be the following:

  1. The client issues a request to https://{domain}/api/rest_v1/data/citation/{format}/{query}
  2. RESTBase checks whether the result is available in storage using (language,format,query) as the look-up key; if found, return the stored response
  3. Issue a request to Citoid
  4. Store the response using (language,format,query)as the key
  5. Return the response

Action Items

  • Set up the public endpoints in RESTBase
  • Hook-up the necessary back-end modules in RESTBase
  • Move to using the public endpoint in the Citoid extension (Deployed by Jan 20).
  • Update documentation

See also

Event Timeline

mobrovac raised the priority of this task from to High.
mobrovac updated the task description. (Show Details)

Store the response using (language,format,query) as the key

Would it make sense to share citations globally whenever the content does not vary by Accept-Language? Are sites that vary on Accept-Language reliably setting a Content-Language header in the response?

If we had a way to reliably figure out if a site is not localized (likely the vast majority of sites, especially those that don't redirect), then we could use the global result from storage without repeating requests for each language.

Would it make sense to share citations globally whenever the content does not vary by Accept-Language?

Yes, definitely. Regardless of the domain the request came from, Citoid's result will depend on the Accept-Language header.

It's the reason for the second action item. Originally I thought POST storage logic would suffice, but global citation-sharing slightly complicates things.

Are sites that vary on Accept-Language reliably setting a Content-Language header in the response?

That's a good question. I'd tend to say yes for the most part, but @Mvolz would probably have more info on that.

If we had a way to reliably figure out if a site is not localized (likely the majority of sites),

That's the core problem - how to quickly identify citation duplicates.

then we could use the global result from storage without repeating requests for each language.

Yup, that would be the end goal, as that would greatly increase the chance of storage hits.

Store the response using (language,format,query) as the key

Would it make sense to share citations globally whenever the content does not vary by Accept-Language? Are sites that vary on Accept-Language reliably setting a Content-Language header in the response?

We don't store content-language at all, and also I think because response headers tend to be fairly unpredictable, relying on it would be a bad idea - but since we know the requested language, that's good enough- we just don't currently have a way of knowing whether if the response varies by it or not. So each language wiki will have its own set of pages.

I'm more interested in other kinds of de-duplication here rather than language though; the requested resource might have many things pointing to it, pmid, pmcid, doi, urls that differ by parameter i.e. google urls often actually have a user string in them- I don't know enough about restbase to determine if any of this kind of deduplication is possible with it though. (Or if it's possible to index by parameters not in the initial search params.)

So each language wiki will have its own set of pages.

If we store this per project, then that's a lot less cache hits. Do you think it would be safe to share responses if all of this is true?

  • the request was not redirected
  • no Content-Language header was set
  • no Vary: Accept-Language header is set in the response
  • the response is cacheable (no no-cache or private in Cache-control, to a first approximation)

If we store this per project, then that's a lot less cache hits. Do you think it would be safe to share responses if all of this is true?

  • the request was not redirected

I'd say we can de-duplicate based on the final URL. Requests without redirects are a rare minority; most of them cause a sea of redirects.

  • no Content-Language header was set
  • no Vary: Accept-Language header is set in the response

These are problematic since they may or may not be present in the response regardless of whether the language can actually be changed.

  • the response is cacheable (no no-cache or private in Cache-control, to a first approximation)

This is true for most resources, AFAIK.

I'm more interested in other kinds of de-duplication here rather than language though; the requested resource might have many things pointing to it, pmid, pmcid, doi, urls that differ by parameter

Most definitely, but as you explain, this is not trivial at all, so I'm thinking we should leave it for a second iteration.

I don't know enough about restbase to determine if any of this kind of deduplication is possible with it though. (Or if it's possible to index by parameters not in the initial search params.)

It's more and more clear that we'll need to build a custom RESTBase module for Citoid, so yes, we shuffle the request query params around at will.

As a first step I propose to set up a simply proxying endpoint in RESTBase under https://{domain}/api/rest_v1/data/citation/{format}/{query} and figure out the storage/sharing later.

As a first step I propose to set up a simply proxying endpoint in RESTBase under https://{domain}/api/rest_v1/data/citation/{format}/{query} and figure out the storage/sharing later.

Agreed, lets start with the API. CDN caching is of low utility for really low-volume use cases like citations, so having this entry point per-project is fine. If we decide to share storage across projects later, we can still do so behind the scenes.

One issue @Mvolz brought up is that citoid currently supports another basefields flag that changes the format to include additional fields. @Mvolz, how is this currently used (who uses it, how often, which combinations), and do you see any issue with folding this into the format parameter?

One issue @Mvolz brought up is that citoid currently supports another basefields flag that changes the format to include additional fields. @Mvolz, how is this currently used (who uses it, how often, which combinations), and do you see any issue with folding this into the format parameter?

Answered in the pr, but currently the basefields param is only valid for mediawiki format, so we could easily roll that into format.

Moving to blocked until the basefields question is resolved.

Change 324911 had a related patch set uploaded (by Mobrovac):
RESTBase: Add the Citoid host/port combo to the config

https://gerrit.wikimedia.org/r/324911

Moving to blocked until the basefields question is resolved.

We have decided on the PR that we shall use a new format for this and keep it alive until we fold it entirely into the mediawiki format.

faidon subscribed.

Citoid is already fronted by our Traffic infrastructure (Varnish), which is obviously a layer capable of caching, with cache hits there being obviously more beneficial than caching in RESTBase due to geographic distribution, among other reasons.

Has this been considered and were the pros/cons of additionally caching in RESTBase evaluated? Even if so, it'd be interesting to hear about this. Thanks!

Change 324911 merged by Giuseppe Lavagetto:
RESTBase: Add the Citoid host/port combo to the config

https://gerrit.wikimedia.org/r/324911

Change 324926 had a related patch set uploaded (by Mobrovac):
RESTBase: Fix the Citoid URI for BetaCluster

https://gerrit.wikimedia.org/r/324926

Change 324926 merged by Elukey:
RESTBase: Fix the Citoid URI for BetaCluster

https://gerrit.wikimedia.org/r/324926

@faidon: Request rates are very low, and request patterns have a wide spread. By the time a given URL or DOI is requested again, the response will very likely have fallen out of Varnish. There are also overheads from DNS resolution for the separate domain.

In any case, performance is only one of several reasons for integrating Citoid in the REST API. From my perspective, the perhaps more important reason is to improve our public API product by documenting the citoid end point as part of the wider API. This makes the functionality more discoverable, pushes us to clean up the API (see the basefields discussion), and lets us make progress on T133001.

@Mvolz would you mind looking into making the Citoid extension using the RESTBase endpoint instead of placing a call to citoid.wikimedia.org ?

With T152221 resolved it seems that only T152220 remains until we can call this done. @Esanders, @mobrovac, @Jdforrester-WMF, could you take a look at @Mvolz's patch at https://gerrit.wikimedia.org/r/#/c/327244/ ?

Mvolz updated the task description. (Show Details)
mobrovac claimed this task.

This can now be considered done. There is still T133001: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames to deal with, though.