Page MenuHomePhabricator

Visual Editor Citation Tool can't handle links to ‚The Wikipedia Library‘
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

Contributions are made without proper sourcing, or abandoned alltogether. Or the TWL links are added manually (several hundreds in dewiki up to today), being of no use to our readers.

Dewiki is currently pondering to block new TWL links by edit filtering, which might be a really bad solution. https://de.wikipedia.org/wiki/Wikipedia:Administratoren/Anfragen#wikipedialibrary_im_ANR_jetzt_blocken (dewiki‘s sysop board, German language).

What should have happened instead?:

CT should recognize TWL links and change them to the original. In my example this would be
https://www.sciencedirect.com/science/article/abs/pii/S014067362201474X?via%3Dihub .

Other good variants would be a link to the journal ( https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(22)01474-X/ ) , or to doi.org ( https://doi.org/10.1016/S0140-6736(22)01474-X ).

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

A970C283-7D89-4E4A-8AFD-8592C89DF776.jpeg (740×1 px, 334 KB)
(apologies for German language in the screenshot)

Event Timeline

MBq updated the task description. (Show Details)
MBq updated the task description. (Show Details)

Thanks for filing this and bringing that conversation to my attention @MBq.

We discussed this with @Mvolz a little at T277655. It is, unfortunately, an inherent problem with proxy authentication from libraries, and isn't limited to The Wikipedia Library. Users accessing content via proxy from any library will encounter the same issue. There's not an easy fix, as I understand it.

@jsn.sherman, based on our notes in T277655, do you think there's something we could do to detect Citoid resolving a URL and route it to the non-proxied URL instead?

I have an idea, but it would take some experimentation to sort out:

[...]
@jsn.sherman, based on our notes in T277655, do you think there's something we could do to detect Citoid resolving a URL and route it to the non-proxied URL instead?

As you are hinting at here, the most reliable way of "unwriting" urls (in ezproxy parlance) is to let ezproxy do it, since it has access to the config used to rewrite it in the first place (until we remove the config, say when a partnership ends). At first glance, it would seem appealing to unwrite those URLs that aren't formed as starting point/target URLs from the library. But we don't want to change the current behavior for those URLs, because that's how users browse between linked proxied resources while maintaining access.

What some other libraries have done to mitigate issues like this:

  1. Deploy a resolver in front of ezproxy, and route all traffic through it. That way we can add a bit of processing in to allow Citoid to get an unwritten url in exactly the conditions we desire. This has big downsides in terms of sustainability and uptime (yet another system to maintain, yet another dsl for its config, single point of failure, etc).
  2. Make it somewhat more convenient for the user agent to pass data to the remote site (citoid in this case) using a bookmarklet or browser extension. This has the downside of requiring user intervention and is error prone in my experience.

I think we could probably come up with something better, but I need to do a spike on it to experiment. What I would try:

  1. designate a unique junk hostname to append for citations like https://www-sciencedirect-com-unwrite-citation.wikipedialibrary.idm.oclc.org/science/article/pii/S014067362201474X?via%3Dihub and make a config stanza that strips the junk and then unwrites the remaining url which could be returned as a redirect.
    • have citoid append the junk to the hostnames when it sees idm.oclc.org urls, That new location could replace the original url during the creation of the citation via an xhttp request if citoid works that way, or
    • maybe that's another bot/tool that requests the url with the unique junk while trawling for existing ezproxy citations after the fact, and then updates citations with the new location header value.
  2. create a public ezproxy page and see if I can get it to do similar processing to the above to provide something more akin to an api interface.

I think we could safely run this publicly since we are kicking user agents out of the proxy. That way the citation tooling doesn't have to know anything about the user session that created the link initially.

If other libraries have tried this, they haven't talked about it. It might result in a cool solution we could share with other libraries, or it might result in nothing.

As I've noticed on enwikipedia Citoid does recognize the DOIs on TWL-proxied pages, but does not draw the information from it. Is that in scope for this task?

As I've noticed on enwikipedia Citoid does recognize the DOIs on TWL-proxied pages, but does not draw the information from it. Is that in scope for this task?

This is the same problem, yep. The easiest fix for the URL you noted in that post, though, would be to simply put the DOI (10.1029/2022JD037575) into Citoid directly:

Screenshot 2023-12-20 at 13.18.47.png (608×914 px, 100 KB)