Set-up Citoid behind RESTBase
Closed, ResolvedPublic0 Estimated Story Points
Actions

Description

The contents of results returned by Citoid are good candidates for storing in RESTBase as they are (mostly) deterministic and consequently immutable. Moreover, that would speed up obtaining results for previous Citoid queries considerably.

See T103811: Public API endpoints for new services for the proposed API endpoints. The workflow would be the following:

The client issues a request to https://{domain}/api/rest_v1/data/citation/{format}/{query}
RESTBase checks whether the result is available in storage using (language,format,query) as the look-up key; if found, return the stored response
Issue a request to Citoid
Store the response using (language,format,query)as the key
Return the response

Action Items

Set up the public endpoints in RESTBase
Hook-up the necessary back-end modules in RESTBase
Move to using the public endpoint in the Citoid extension (Deployed by Jan 20).
Update documentation

Details

	Subject	Repo	Branch	Lines +/-
	RESTBase: Fix the Citoid URI for BetaCluster	operations/puppet	production	+1 -0
	RESTBase: Add the Citoid host/port combo to the config	operations/puppet	production	+9 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T133001 Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames
Resolved	• mobrovac	T108646 Set-up Citoid behind RESTBase
Resolved	• mobrovac	T103811 Public API endpoints for new services
Declined	None	T149619 Consider removing basefields option from citoid API
Resolved	Mvolz	T152220 Change requests in citoid extension to new RESTBase endpoint.
Resolved	kaldari	T152221 Change reftoolbar to make requests to new citoid RESTbase service.

Event Timeline

• mobrovac created this task.Aug 10 2015, 11:23 PM

• mobrovac raised the priority of this task from to High.

• mobrovac updated the task description. (Show Details)

• mobrovac added projects: Citoid, RESTBase, RESTBase-API, Services.

• mobrovac added subscribers: • mobrovac, Mvolz, Jdforrester-WMF.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2015, 11:23 PM

• mobrovac added a subtask: T103811: Public API endpoints for new services.Aug 10 2015, 11:23 PM

• mobrovac mentioned this in T108632: http://citoid.wikimedia.org/ should force HTTPS.Aug 10 2015, 11:27 PM

Store the response using (language,format,query) as the key

Would it make sense to share citations globally whenever the content does not vary by Accept-Language? Are sites that vary on Accept-Language reliably setting a Content-Language header in the response?

If we had a way to reliably figure out if a site is not localized (likely the vast majority of sites, especially those that don't redirect), then we could use the global result from storage without repeating requests for each language.

In T108646#1526139, @GWicke wrote:

Would it make sense to share citations globally whenever the content does not vary by Accept-Language?

Yes, definitely. Regardless of the domain the request came from, Citoid's result will depend on the Accept-Language header.

It's the reason for the second action item. Originally I thought POST storage logic would suffice, but global citation-sharing slightly complicates things.

Are sites that vary on Accept-Language reliably setting a Content-Language header in the response?

That's a good question. I'd tend to say yes for the most part, but @Mvolz would probably have more info on that.

If we had a way to reliably figure out if a site is not localized (likely the majority of sites),

That's the core problem - how to quickly identify citation duplicates.

then we could use the global result from storage without repeating requests for each language.

Yup, that would be the end goal, as that would greatly increase the chance of storage hits.

Mvolz moved this task from Backlog to Production on the Citoid board.Aug 12 2015, 9:01 AM

In T108646#1526139, @GWicke wrote:

Store the response using (language,format,query) as the key

Would it make sense to share citations globally whenever the content does not vary by Accept-Language? Are sites that vary on Accept-Language reliably setting a Content-Language header in the response?

We don't store content-language at all, and also I think because response headers tend to be fairly unpredictable, relying on it would be a bad idea - but since we know the requested language, that's good enough- we just don't currently have a way of knowing whether if the response varies by it or not. So each language wiki will have its own set of pages.

I'm more interested in other kinds of de-duplication here rather than language though; the requested resource might have many things pointing to it, pmid, pmcid, doi, urls that differ by parameter i.e. google urls often actually have a user string in them- I don't know enough about restbase to determine if any of this kind of deduplication is possible with it though. (Or if it's possible to index by parameters not in the initial search params.)

So each language wiki will have its own set of pages.

If we store this per project, then that's a lot less cache hits. Do you think it would be safe to share responses if all of this is true?

the request was not redirected
no Content-Language header was set
no Vary: Accept-Language header is set in the response
the response is cacheable (no no-cache or private in Cache-control, to a first approximation)

In T108646#1531799, @GWicke wrote:

If we store this per project, then that's a lot less cache hits. Do you think it would be safe to share responses if all of this is true?

the request was not redirected

I'd say we can de-duplicate based on the final URL. Requests without redirects are a rare minority; most of them cause a sea of redirects.

no Content-Language header was set

no Vary: Accept-Language header is set in the response

These are problematic since they may or may not be present in the response regardless of whether the language can actually be changed.

the response is cacheable (no no-cache or private in Cache-control, to a first approximation)

This is true for most resources, AFAIK.

In T108646#1531386, @Mvolz wrote:

I'm more interested in other kinds of de-duplication here rather than language though; the requested resource might have many things pointing to it, pmid, pmcid, doi, urls that differ by parameter

Most definitely, but as you explain, this is not trivial at all, so I'm thinking we should leave it for a second iteration.

I don't know enough about restbase to determine if any of this kind of deduplication is possible with it though. (Or if it's possible to index by parameters not in the initial search params.)

It's more and more clear that we'll need to build a custom RESTBase module for Citoid, so yes, we shuffle the request query params around at will.

• mobrovac moved this task from Backlog to Under discussion on the RESTBase board.Aug 12 2015, 3:47 PM

• mobrovac mentioned this in T110476: Remove citoid from parsoidcache.Aug 27 2015, 5:02 PM

• GWicke added a project: Services-next.Jan 27 2016, 11:07 PM

• GWicke set Security to None.

• GWicke updated the task description. (Show Details)Jul 26 2016, 3:26 PM

Restricted Application added a project: VisualEditor. · View Herald TranscriptJul 26 2016, 3:26 PM

Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.Jul 26 2016, 7:08 PM

Jdforrester-WMF set the point value for this task to 0.Aug 1 2016, 4:56 PM

Jdforrester-WMF moved this task from TR0: Interrupt to External and Administrivia on the VisualEditor board.Aug 9 2016, 7:30 PM

• mobrovac added a project: User-mobrovac.Aug 10 2016, 7:20 PM

• GWicke closed subtask T103811: Public API endpoints for new services as Resolved.Sep 19 2016, 7:35 PM

• GWicke edited projects, added Services (next); removed Services.Oct 12 2016, 3:59 PM

• GWicke updated the task description. (Show Details)Oct 27 2016, 8:52 PM

As a first step I propose to set up a simply proxying endpoint in RESTBase under https://{domain}/api/rest_v1/data/citation/{format}/{query} and figure out the storage/sharing later.

In T108646#2750317, @Pchelolo wrote:

As a first step I propose to set up a simply proxying endpoint in RESTBase under https://{domain}/api/rest_v1/data/citation/{format}/{query} and figure out the storage/sharing later.

Agreed, lets start with the API. CDN caching is of low utility for really low-volume use cases like citations, so having this entry point per-project is fine. If we decide to share storage across projects later, we can still do so behind the scenes.

Created a PR here: https://github.com/wikimedia/restbase/pull/699

One issue @Mvolz brought up is that citoid currently supports another basefields flag that changes the format to include additional fields. @Mvolz, how is this currently used (who uses it, how often, which combinations), and do you see any issue with folding this into the format parameter?

In T108646#2752712, @GWicke wrote:

One issue @Mvolz brought up is that citoid currently supports another basefields flag that changes the format to include additional fields. @Mvolz, how is this currently used (who uses it, how often, which combinations), and do you see any issue with folding this into the format parameter?

Answered in the pr, but currently the basefields param is only valid for mediawiki format, so we could easily roll that into format.

Mvolz created subtask T149619: Consider removing basefields option from citoid API.Oct 31 2016, 7:43 PM

Moving to blocked until the basefields question is resolved.

• mobrovac mentioned this in T149619: Consider removing basefields option from citoid API.Nov 2 2016, 9:36 AM

Change 324911 had a related patch set uploaded (by Mobrovac):
RESTBase: Add the Citoid host/port combo to the config

https://gerrit.wikimedia.org/r/324911

gerritbot added a project: Patch-For-Review.Dec 2 2016, 2:27 PM

In T108646#2761433, @Pchelolo wrote:

Moving to blocked until the basefields question is resolved.

We have decided on the PR that we shall use a new format for this and keep it alive until we fold it entirely into the mediawiki format.

Citoid is already fronted by our Traffic infrastructure (Varnish), which is obviously a layer capable of caching, with cache hits there being obviously more beneficial than caching in RESTBase due to geographic distribution, among other reasons.

Has this been considered and were the pros/cons of additionally caching in RESTBase evaluated? Even if so, it'd be interesting to hear about this. Thanks!

Change 324911 merged by Giuseppe Lavagetto:
RESTBase: Add the Citoid host/port combo to the config

https://gerrit.wikimedia.org/r/324911

Change 324926 had a related patch set uploaded (by Mobrovac):
RESTBase: Fix the Citoid URI for BetaCluster

https://gerrit.wikimedia.org/r/324926

Change 324926 merged by Elukey:
RESTBase: Fix the Citoid URI for BetaCluster

https://gerrit.wikimedia.org/r/324926

@faidon: Request rates are very low, and request patterns have a wide spread. By the time a given URL or DOI is requested again, the response will very likely have fallen out of Varnish. There are also overheads from DNS resolution for the separate domain.

In any case, performance is only one of several reasons for integrating Citoid in the REST API. From my perspective, the perhaps more important reason is to improve our public API product by documenting the citoid end point as part of the wider API. This makes the functionality more discoverable, pushes us to clean up the API (see the basefields discussion), and lets us make progress on T133001.

@Mvolz would you mind looking into making the Citoid extension using the RESTBase endpoint instead of placing a call to citoid.wikimedia.org ?

Mvolz created subtask T152220: Change requests in citoid extension to new RESTBase endpoint. .Dec 2 2016, 5:24 PM

Mvolz created subtask T152221: Change reftoolbar to make requests to new citoid RESTbase service..

kaldari closed subtask T152221: Change reftoolbar to make requests to new citoid RESTbase service. as Resolved.Dec 2 2016, 9:45 PM

• ema moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Dec 5 2016, 11:43 AM

Mvolz added a subtask: T153214: Citoid restbase endpoint not configured correctly in vagrant.Dec 14 2016, 5:27 PM

• mobrovac closed subtask T153214: Citoid restbase endpoint not configured correctly in vagrant as Resolved.Dec 16 2016, 2:04 PM

Jdforrester-WMF added a parent task: T133001: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames.Jan 4 2017, 4:58 PM

• GWicke updated the task description. (Show Details)Jan 4 2017, 11:38 PM

With T152221 resolved it seems that only T152220 remains until we can call this done. @Esanders, @mobrovac, @Jdforrester-WMF, could you take a look at @Mvolz's patch at https://gerrit.wikimedia.org/r/#/c/327244/ ?

Jdforrester-WMF closed subtask T152220: Change requests in citoid extension to new RESTBase endpoint. as Resolved.Jan 5 2017, 7:07 PM

Mvolz updated the task description. (Show Details)Jan 5 2017, 7:22 PM

Mvolz updated the task description. (Show Details)

Jdforrester-WMF removed a project: Patch-For-Review.Jan 5 2017, 10:35 PM

Mvolz updated the task description. (Show Details)Jan 9 2017, 11:52 AM

This can now be considered done. There is still T133001: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames to deal with, though.

• mobrovac reopened subtask T153214: Citoid restbase endpoint not configured correctly in vagrant as Open.Oct 16 2017, 7:16 PM

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptOct 16 2017, 7:16 PM

• mobrovac removed a subtask: T153214: Citoid restbase endpoint not configured correctly in vagrant.Mar 1 2018, 8:42 PM

Mvolz changed the status of subtask T149619: Consider removing basefields option from citoid API from Open to Stalled.Jul 1 2018, 2:38 PM

Mvolz closed subtask T149619: Consider removing basefields option from citoid API as Declined.Sep 26 2019, 9:23 AM

Set-up Citoid behind RESTBaseClosed, ResolvedPublic0 Estimated Story PointsActions