Page MenuHomePhabricator

Citoid should convert proxied domain used by the Wikipedia Library into accessible links
Closed, DuplicatePublic

Description

Taken from this discussion:

For TWL users to access Newspapers.com, the library sends them through a proxied domain at https://www-newspapers-com.wikipedialibrary.idm.oclc.org/. This often results in this domain name making its way into the mainspace, which is problematic because it can only be accessed by those with access to TWL.

One example with an edit that used the proxy link, then was manually fixed:
https://www-newspapers-com.wikipedialibrary.idm.oclc.org/article/the-baltimore-sun/137543642/ --> https://newspapers.com/article/the-baltimore-sun/137543642/
the first link redirects to TWL's home page while the second links to the article.

Event Timeline

Hey! I should probably pop in and say that I operate a bot working to fix the aftermath of these proxied links being added, but I agree that it would be nice to fix them before they even hit the source code.

I'm also hoping to expand this to work on more than just Newspapers.com links, but it's difficult because of the need to adapt to different domains. Basically, I can't just blindly search for wikipedialibrary.idm.oclc.org and replace the www-newspapers-com bit before it because some proxied domains actually use a - instead of a . and won't work if I just replace all the dashes. I'd worry that this same snag might affect a quick fix in Citoid.

Most likely, I plan on trying to parse our EZProxy configuration to find all the domain names TWL can possibly proxy and go from there. This might work for finding/replacing in Citoid, too.

Hey! I should probably pop in and say that I operate a bot working to fix the aftermath of these proxied links being added, but I agree that it would be nice to fix them before they even hit the source code.

I'm also hoping to expand this to work on more than just Newspapers.com links, but it's difficult because of the need to adapt to different domains. Basically, I can't just blindly search for wikipedialibrary.idm.oclc.org and replace the www-newspapers-com bit before it because some proxied domains actually use a - instead of a . and won't work if I just replace all the dashes. I'd worry that this same snag might affect a quick fix in Citoid.

Most likely, I plan on trying to parse our EZProxy configuration to find all the domain names TWL can possibly proxy and go from there. This might work for finding/replacing in Citoid, too.

If there's anything we can do to help with this please let me know :)

@Samwalton9-WMF: Actually, I know there was some discussion in that original task mentioned above about the same problems I mentioned here. Any chance you might know of a reliable method to convert these links? Even a general idea of what direction to take would be great.

@Samwalton9-WMF: Actually, I know there was some discussion in that original task mentioned above about the same problems I mentioned here. Any chance you might know of a reliable method to convert these links? Even a general idea of what direction to take would be great.

Per T277655#6921659 and your note above, unfortunately the proxying of links by EZProxy is destructive. There's no good way to know whether the original link had a . or a - since they both become - in the proxied version.

I think your approach of only considering the list of URLs at https://github.com/WikipediaLibrary/twlight_ezproxy/blob/master/expert/ezproxy.cfg (and the related files) is sensible. In case you're not familiar, any line which has something like IncludeFile databases/wileyonlinelibrary.txt corresponds to the predefined URLs behind the links at https://help.oclc.org/Library_Management/EZproxy/EZproxy_database_stanzas/Database_stanzas/EZproxy_database_stanzas_-_All. It looks like we've tended to include the relevant URL above the IncludeFile line, so hopefully you can parse that?