Page MenuHomePhabricator

In cite journal, do not add URL expanded from existing identifier like DOI
Closed, DeclinedPublic

Description

When using Citoid to expand a DOI or other identifier into a full citation template, the link to the DOI or identifier target often gets added to the |url= parameter. The citation then has the same information twice and the URL is redundant. Guidelines and documentation discourage this, for instance https://en.wikipedia.org/wiki/Help:Citation_Style_1#Identifiers :

When an URL is equivalent to the link produced by the corresponding identifier (such as a DOI), don't add it to any URL parameter but use the appropriate identifier parameter, which is more stable and may allow to specify the access status. The |url= parameter or title link can then be used for its prime purpose of providing a convenience link to an open access copy which would not otherwise be obviously accessible.

Other tools and bots like citation bot on the English Wikipedia will later remove such URLs, but it's better to not add them in the first place. When the user actually inputs an URL, such as the DOI preceded by https://doi.org/, it's debatable whether that should be retained as URL: you may want to keep it if you consider the users are so sophisticated as to provide a bare ID to use the corresponding parameter and the full URL to (also) use the URL parameter. I personally believe most people use them indifferently just to provide the ID in whatever is the most convenient way at the moment, so it doesn't matter. It also doesn't matter on the English Wikipedia anyway because eventually such redundant URLs will be removed.

This task is arguably a first step before T174540, although T174540 could also be implemented by just putting whatever is the target of OAdoi.org expansion (which make the URL parameter redundant in 50-70 % of cases rather than the current 100 %).

Event Timeline

Nemo_bis triaged this task as Medium priority.Sep 12 2019, 7:01 PM

Change 536317 had a related patch set uploaded (by Nemo bis; owner: Federico Leva):
[mediawiki/services/citoid@master] Avoid adding URL which duplicates DOI

https://gerrit.wikimedia.org/r/536317

I've added a patch I wrote some time ago (not tested). At the time I was not sure what method to use as I couldn't see an easy way to track the provenance of pieces of data such as the URL parameter, so I tried to use a method which would work in most cases while having minimal impact on the code.

Dupe of T190850, I think. It was declined because not every template on every wiki has a separate field for pmid, doi, pmc and need the url parameter.

Doing this on the backend affects every wiki in every language. Since bots remove the duplicate when they occur in wikis where this is unwanted, I don't think doing it in the backend is the correct solution.

Dupe of T190850, I think. It was declined because not every template on every wiki has a separate field for pmid, doi, pmc and need the url parameter.

The intention is to only remove the URL when a parameter with unique identifier is added to the wikitext, so we don't need to worry about templates which do not support such parameters. Then I cannot guarantee my patch does that yet, I was hoping you could have suggestions on how to achieve it.

Doing this on the backend affects every wiki in every language. Since bots remove the duplicate when they occur in wikis where this is unwanted, I don't think doing it in the backend is the correct solution.

Adding unrequested URLs to publishers, not asked or provided by the user, is a disaster in waiting (see also the recent 5000 squatted links). There is no possible justification for it: bad URLs are never needed. DOI.org URLs can sometimes be justifiable.

I'm coming back to the general topic of

Dupe of T190850, I think. It was declined because not every template on every wiki has a separate field for pmid, doi, pmc and need the url parameter.

And I feel really stuck. This one seems like really simple logic to deal with:

if identifer_parameter then fill_parameter
else if not URL_in_URL_parameter then fill_URL

Why does it matter then if not every wiki has every kind of identifier?

There was a recently-closed RFC on enwiki that may be relevant

It's sort of relevant, but this task is not about removing URLs automatically, it's about not adding them automatically. Users should add URLs to the URL parameter if they want to, not just because VisualEditor thinks it's a good idea. I believe the general idea of the RfC is to respect the editors' wishes for the citation templates they add, and this task would go in the same direction.

VisualEditor is currently adding to technical debt/maintenance debt for future generations, when it adds non-permanent URLs to the citation templates. Publishers die all the time, or they get sold and all their URLs get broken and squatted by fraudsters (as with Blackwell cited above). When the user gives us a permalink or unique identifier, it's only appropriate to give them what they asked for, rather than lower quality non-permanent URLs.

The English Wikipedia is big, so at some point someone will come up and code a bot to clean up the broken URLs. But on hundreds of wikis the broken links will probably remain forever.

Not all citation templates have a DOI field, and only have a URL field. As long as we have this information, we should provide it, because some consumers will prefer to use the URL. If en wiki prefers to remove these with bots that's a preferred option imo. We can't delete fields that are being used on other wikis in the back-end just because of policy in en wiki.

Also, spec says that URL is a required field that will always returned, along with title and access-date, and although that could be changed I don't see a reason for it.

(We could consider changing *which* URL goes in in the URL field, but not providing it all is what has been declined.)