Page MenuHomePhabricator

Validate URLs before calling Citoid
Closed, ResolvedPublic

Description

Wikipedia editors should be able to enter whatever value in citation template "url" (or aliases, see T301516) parameters, including invalid URLs (i.e., badly formatted). I assume Citoid would respond with a 400 Bad Request error in these cases, or it may try to resolve it somehow else without using Zotero translators (e.g., as a free text query). This would unnecessarily load the Citoid service (see T301510), and may even give us incorrect results (e.g., if the query is interpreted as free text).

To prevent this, consider using a library to validate that the value found in the "url" parameter (or one of its aliases) is a valid URL. Consider doing this at an early stage, such that a citation template can be dropped if it does not have a valid URL (just as we currently drop citation templates without URLs)

On the other hand, apparently some of our calls to Citoid are returning 404 errors. The most likely reason is that the resource no longer exists. However, I tried a request for "http://repositorio.unb.br/bitstream/10482/5542/1/2008_NeideMOGodinho.pdf" and got a 404 error, even though the resource does exist. Zotero does not support translating PDFs. See T136722 and T214038. I haven't checked Citoid's source code for this, but I wonder if we are getting a 404 response because of that.

We should avoid making Citoid requests for URLs pointing to PDF files. As far as I know, the only way to know for sure if a URL is pointing to a PDF file is checking for its "Content-Type" response header. We could make a separate HEAD request for all URLs before calling Citoid, but that would mean an extra burden. Alternatively, as a partial workaround, we could drop all URLs ending in ".pdf" at URL validation, assuming that they are in fact pointing to a URL file.

Event Timeline

Nidiah changed the task status from Open to In Progress.Feb 23 2022, 7:42 PM

URLs are validated before request using validators.url(url) (see doc: https://validators.readthedocs.io)

Indeed, URLs ending in .pdf return 404 error. But they represent only 6.35% of our corpus.

Indeed, URLs ending in .pdf return 404 error. But they represent only 6.35% of our corpus.

Thanks for the info. That's a lot! It would represent roughly 30k references, right? That may take us a lot of time already, as per T301510. I think it would be reasonable to simply ignore them. What do you think?

Sure, at the current rhythm, that would save us 22h for T301510.

I will also check the validation method, I think it is not working properly. I estimate that non valid URLs are ~17k, that is 13h of queries.

Maybe also check and ignore some of the cases discussed by email, that would be valid URLs, but that would return error 400 from Citoid:

The other cases discussed by email would either not be valid URLs (the URL doesn't have a hostname) or would require actually fetching the URL by us to find out (the hostname resolves to a private URL, or does not resolve, or the maximum number of redirections is reached).

Given that the number of URLs not found is so high (25% as discussed by email) we may eventually consider making a HEAD request to the actual resource to check if it returns an HTTP 200 response before making the request to Citoid. This may help with T301510. I'm going to open a separate issue for this, though.

As discussed with @Nidiah, ignoring URLs that don't start with http or https is pending.

Note that @Kerry_Raymond has found other URLs that are failing with Citoid. She has listed them here, some of which have been listed in our list of problematic URLs as well.

I have tried some of those included in our list:

I've added these comments to our list of problematic URLs. I guess these should be reported to the Citoid project instead.

For our project, I imagine these are exceptions. I wouldn't try to deal with them.

URLs having schemes other than http or https or private hosts were excluded.

Total urls extracted from articles: 461k
Valid urls: 432k
Valid urls without duplicates: 383k