Page MenuHomePhabricator

Integrate EBSCO Discovery Service into Earwig's Copyvio Detector
Open, Needs TriagePublic

Description

Editors may sometimes add copyrighted text copied from a paywalled source. Such additions may not currently be uncovered by Earwig's Copyvio Detector because it is not able to access such sources. We have access to EBSCO Discovery Service (EDS) in The Wikipedia Library, which editors use to browse content from The Wikipedia Library's publisher partners. EBSCO have given us API credentials to access EDS for the purposes of searching for copyrighted text on Wikipedia, and so we would like to add this functionality to the Copyvios tool.

Users must log in to Copyvios with their Wikipedia login via OAuth. As such, we can check if they are an eligible Wikipedia Library user with an API call (pending T372853).

We can add an extra checkbox on the search page for "Use The Wikipedia Library". Then, when they log in, we do the eligibility check.

If a user is library eligible...
We can display the full text to them in the Copyvios tool just as we do for open web content. When we do so, displaying the source URL is likely not very helpful because EDS URLs are not human-readable. Instead we could display bibliographic information like the source title, author, publication year, identifer, etc.

EDS1.png (1×1 px, 225 KB)

If a user is not library eligible...
We cannot display any text from the matching source, but we can still display the bibliographic information and match %. We can also highlight which text from the article matched in this source.

Instead of the source text, we should display a note about the user not being eligible for The Wikipedia Library.

Non-TWL user.png (1×1 px, 170 KB)

Event Timeline

Restricted Application added subscribers: Sadads, Aklapper. · View Herald Transcript

Some open questions I still have. Thoughts welcome:

Instead of the source text, we should display a note about the user not being eligible for The Wikipedia Library.

I just wrote an example sentence in the mockup, but I wonder if it's worth documenting further what the EDS integration is / how you can become eligible for the library somewhere and linking out to that, or just having a more verbose note here?

In terms of the specifics of how this integration should work - would we add another checkbox here for EDS? What would we label it? "Wikipedia Library search"? I wonder if we should specifically name EBSCO Discovery Service somewhere. We don't in the library before you search, but once you get to the search results there's EBSCO/EDS branding.

Screenshot 2024-10-24 at 12.51.34.png (188×1 px, 34 KB)

What would happen if a user selected EDS, logged in, and wasn't eligible - should we display a notice somewhere that their search isn't being performed with EDS, only the other options? What about if they only select EDS?

I just wrote an example sentence in the mockup, but I wonder if it's worth documenting further what the EDS integration is / how you can become eligible for the library somewhere and linking out to that, or just having a more verbose note here

Linking to the main The Wikipedia Library site should suffice.

In terms of the specifics of how this integration should work - would we add another checkbox here for EDS? What would we label it? "Wikipedia Library search"? I wonder if we should specifically name EBSCO Discovery Service somewhere. We don't in the library before you search, but once you get to the search results there's EBSCO/EDS branding.

Another checkbox, yeah. The label should probably be "Use Wikipedia Library search", using the most recognizable name to editors (and the actual prerequisite for access the feature). I guess the only barrier here is if EBSCO wants us to name them specifically.

What would happen if a user selected EDS, logged in, and wasn't eligible - should we display a notice somewhere that their search isn't being performed with EDS, only the other options? What about if they only select EDS?

I'm assuming whatever endpoint or check used in T372853 probably won't be too expensive to call. In this case, we can probably just run a check on every page load (or cache and then revalidate when stale, TTL to be determined) and then change the disabled state of the checkbox based on that. If the user does not have access (or isn't logged in), we'd keep the checkbox disabled. If the user has access, the checkbox will be enabled (but unchecked by default). This essentially restricts all EDS searches to people who have access to TWL. Any attempt to bypass the disabled checkbox (such as with a (hypothetical) ?use_eds=1 query parameter) should show an error, much like how the 429 errors appeared to the user.

Screenshot_1.png (207×990 px, 8 KB)

What would happen if a user selected EDS, logged in, and wasn't eligible - should we display a notice somewhere that their search isn't being performed with EDS, only the other options? What about if they only select EDS?

I'm assuming whatever endpoint or check used in T372853 probably won't be too expensive to call. In this case, we can probably just run a check on every page load (or cache and then revalidate when stale, TTL to be determined) and then change the disabled state of the checkbox based on that. If the user does not have access (or isn't logged in), we'd keep the checkbox disabled. If the user has access, the checkbox will be enabled (but unchecked by default). This essentially restricts all EDS searches to people who have access to TWL. Any attempt to bypass the disabled checkbox (such as with a (hypothetical) ?use_eds=1 query parameter) should show an error, much like how the 429 errors appeared to the user.

It occurs to me, actually, that in the mock above this wouldn't matter - I proposed a system where anyone can use the Wikipedia Library search, we just only show links and excerpts if the user is TWL-eligible.

Ah, good point! In that case, we'll just need the link to the main TWL site (and perhaps include a bit of explanation).

Also, just to confirm, did we make it clear to EBSCO that anonymous users (logged out) can also use Copyvios? And if so, are they okay with us allowing search requests while a user is logged out or should we also put this behind the login wall (like with Google searches)?

Also, just to confirm, did we make it clear to EBSCO that anonymous users (logged out) can also use Copyvios? And if so, are they okay with us allowing search requests while a user is logged out or should we also put this behind the login wall (like with Google searches)?

Not sure I understand - isn't OAuth login required to do a search? Or is it only required for the Google search?

Also, just to confirm, did we make it clear to EBSCO that anonymous users (logged out) can also use Copyvios? And if so, are they okay with us allowing search requests while a user is logged out or should we also put this behind the login wall (like with Google searches)?

Not sure I understand - isn't OAuth login required to do a search? Or is it only required for the Google search?

OAuth is only required for Google search. Anyone can use the comparison or Copyvio bot matches to the Turnitin API.

The Wikipedia Library eligibility component of this is now unblocked - you can use the api at https://wikipedialibrary.wmflabs.org/api/v0/users/eligibility/<USERNAME> to retrieve the status of a given user via the boolean wp_bundle_authorized value.