Page MenuHomePhabricator

Consider providing an option to make a HEAD request to the target URL to confirm target availability and follow redirects
Open, Needs TriagePublic

Description

Web2Cit-Core should follow redirect statements in configuration files (T304772). This would be used to define domain aliases (e.g. www.example.com = example.com).

However, if one forgets to define such aliases explicitly, Web2Cit would fallback to using Citoid (see also T302019).

Additionally, consider providing an option to make a HEAD request beforehand to the target webpage and using the domain indicated in any redirect responses, instead of using the domain of the original URL.

Event Timeline

diegodlh renamed this task from Consider making a HEAD request to target URL and following any redirects to Consider providing an option to make a HEAD request to the target URL to confirm target availability and follow redirects.Mar 31 2022, 3:56 PM

I am not sure whether we should implement this at the Web2Cit-Core or at the Web2Cit-Server. I have a feeling that it would be better at the server. However, because it may be required by the Web2Cit-Monitor as well, it would imply implementing it there as well.

Having this implemented as an optional behavior by the Web2Cit-Core library, and letting consumers (the server, the monitor, etc) decide whether to use it or not, sound like a good compromise to me.

Hence, updating the title to "consider providing an option to..."

We may have cases where the target webpage does not exist (i.e., the HTTP request times out or returns a 404 error) but a translation template applies.

This would happen if the template does not have any Citoid or Xpath selection steps in required fields; or it does, but along with other selection/transformation steps that make the field output valid. Note: As per T305163, Citoid and Xpath selection should return an empty StepOutput if the external resource they rely on is unavailable.

Do we want the translation server to return a citation if the cited source does not actually exist? I think we shouldn't. To prevent this from happening, we may also use the initial HEAD request described in this task to detect if the target webpage exists. Changing the task's title accordingly.