Page MenuHomePhabricator

Integrate EBSCO Discovery Service into Earwig's Copyvio Detector
Open, In Progress, Needs TriagePublic

Assigned To
Authored By
Samwalton9-WMF
Oct 24 2024, 11:50 AM
Referenced Files
F79432726: image.png
May 3 2026, 11:12 AM
F79432630: image.png
May 3 2026, 11:12 AM
F57639778: Screenshot_1.png
Oct 24 2024, 3:10 PM
F57639418: Screenshot 2024-10-24 at 12.51.34.png
Oct 24 2024, 11:54 AM
F57639413: Non-TWL user.png
Oct 24 2024, 11:50 AM
F57639410: EDS1.png
Oct 24 2024, 11:50 AM

Description

Editors may sometimes add copyrighted text copied from a paywalled source. Such additions may not currently be uncovered by Earwig's Copyvio Detector because it is not able to access such sources. We have access to EBSCO Discovery Service (EDS) in The Wikipedia Library, which editors use to browse content from The Wikipedia Library's publisher partners. EBSCO have given us API credentials to access EDS for the purposes of searching for copyrighted text on Wikipedia, and so we would like to add this functionality to the Copyvios tool.

Users must log in to Copyvios with their Wikipedia login via OAuth. As such, we can check if they are an eligible Wikipedia Library user with an API call (pending T372853).

We can add an extra checkbox on the search page for "Use The Wikipedia Library". Then, when they log in, we do the eligibility check.

If a user is library eligible...
We can display the full text to them in the Copyvios tool just as we do for open web content. When we do so, displaying the source URL is likely not very helpful because EDS URLs are not human-readable. Instead we could display bibliographic information like the source title, author, publication year, identifer, etc.

EDS1.png (1,258×1,184 px, 225 KB)

If a user is not library eligible...
We cannot display any text from the matching source, but we can still display the bibliographic information and match %. We can also highlight which text from the article matched in this source.

Instead of the source text, we should display a note about the user not being eligible for The Wikipedia Library.

Non-TWL user.png (1,258×1,184 px, 170 KB)

Event Timeline

Restricted Application added subscribers: Sadads, Aklapper. · View Herald Transcript

Some open questions I still have. Thoughts welcome:

Instead of the source text, we should display a note about the user not being eligible for The Wikipedia Library.

I just wrote an example sentence in the mockup, but I wonder if it's worth documenting further what the EDS integration is / how you can become eligible for the library somewhere and linking out to that, or just having a more verbose note here?

In terms of the specifics of how this integration should work - would we add another checkbox here for EDS? What would we label it? "Wikipedia Library search"? I wonder if we should specifically name EBSCO Discovery Service somewhere. We don't in the library before you search, but once you get to the search results there's EBSCO/EDS branding.

Screenshot 2024-10-24 at 12.51.34.png (1,358×188 px, 34 KB)

What would happen if a user selected EDS, logged in, and wasn't eligible - should we display a notice somewhere that their search isn't being performed with EDS, only the other options? What about if they only select EDS?

I just wrote an example sentence in the mockup, but I wonder if it's worth documenting further what the EDS integration is / how you can become eligible for the library somewhere and linking out to that, or just having a more verbose note here

Linking to the main The Wikipedia Library site should suffice.

In terms of the specifics of how this integration should work - would we add another checkbox here for EDS? What would we label it? "Wikipedia Library search"? I wonder if we should specifically name EBSCO Discovery Service somewhere. We don't in the library before you search, but once you get to the search results there's EBSCO/EDS branding.

Another checkbox, yeah. The label should probably be "Use Wikipedia Library search", using the most recognizable name to editors (and the actual prerequisite for access the feature). I guess the only barrier here is if EBSCO wants us to name them specifically.

What would happen if a user selected EDS, logged in, and wasn't eligible - should we display a notice somewhere that their search isn't being performed with EDS, only the other options? What about if they only select EDS?

I'm assuming whatever endpoint or check used in T372853 probably won't be too expensive to call. In this case, we can probably just run a check on every page load (or cache and then revalidate when stale, TTL to be determined) and then change the disabled state of the checkbox based on that. If the user does not have access (or isn't logged in), we'd keep the checkbox disabled. If the user has access, the checkbox will be enabled (but unchecked by default). This essentially restricts all EDS searches to people who have access to TWL. Any attempt to bypass the disabled checkbox (such as with a (hypothetical) ?use_eds=1 query parameter) should show an error, much like how the 429 errors appeared to the user.

Screenshot_1.png (990×207 px, 8 KB)

What would happen if a user selected EDS, logged in, and wasn't eligible - should we display a notice somewhere that their search isn't being performed with EDS, only the other options? What about if they only select EDS?

I'm assuming whatever endpoint or check used in T372853 probably won't be too expensive to call. In this case, we can probably just run a check on every page load (or cache and then revalidate when stale, TTL to be determined) and then change the disabled state of the checkbox based on that. If the user does not have access (or isn't logged in), we'd keep the checkbox disabled. If the user has access, the checkbox will be enabled (but unchecked by default). This essentially restricts all EDS searches to people who have access to TWL. Any attempt to bypass the disabled checkbox (such as with a (hypothetical) ?use_eds=1 query parameter) should show an error, much like how the 429 errors appeared to the user.

It occurs to me, actually, that in the mock above this wouldn't matter - I proposed a system where anyone can use the Wikipedia Library search, we just only show links and excerpts if the user is TWL-eligible.

Ah, good point! In that case, we'll just need the link to the main TWL site (and perhaps include a bit of explanation).

Also, just to confirm, did we make it clear to EBSCO that anonymous users (logged out) can also use Copyvios? And if so, are they okay with us allowing search requests while a user is logged out or should we also put this behind the login wall (like with Google searches)?

Also, just to confirm, did we make it clear to EBSCO that anonymous users (logged out) can also use Copyvios? And if so, are they okay with us allowing search requests while a user is logged out or should we also put this behind the login wall (like with Google searches)?

Not sure I understand - isn't OAuth login required to do a search? Or is it only required for the Google search?

Also, just to confirm, did we make it clear to EBSCO that anonymous users (logged out) can also use Copyvios? And if so, are they okay with us allowing search requests while a user is logged out or should we also put this behind the login wall (like with Google searches)?

Not sure I understand - isn't OAuth login required to do a search? Or is it only required for the Google search?

OAuth is only required for Google search. Anyone can use the comparison or Copyvio bot matches to the Turnitin API.

The Wikipedia Library eligibility component of this is now unblocked - you can use the api at https://wikipedialibrary.wmflabs.org/api/v0/users/eligibility/<USERNAME> to retrieve the status of a given user via the boolean wp_bundle_authorized value.

Chlod changed the task status from Open to In Progress.May 3 2026, 9:25 AM
Chlod moved this task from Backlog to Proposed Projects on the Wikimedia-Hackathon-2026 board.

Worked on this for the Hackathon. Surpised that EBSCO still has not revoked the credentials we got from them!

Got the MVP for this now. I still want to clean it up a lot more before submitting a PR. The current status quo is:

  • user not logged in: EDS cannot be used (login will be prompted)
  • user logged in but has no access to TWL: EDS can be used, but the full text is not shown
  • user logged in and has access to TWL: EDS can be used and the full text is shown

EDS hands us a proxy link that automatically leads to the correct resource, which is eventually shown in the results. This link is only ever accessible if the user has a TWL account; they'll be asked to log in through wikipedialibrary.wmflabs.org otherwise. The URL is replaced with the title of the source when results are being shown. On the API, this is a new "title" field.

For UX, the interface calls this "Use Wikipedia Library" instead of something like "Use EDS" or "Use EBSCO Discovery Service".

Screenshots! :D

Still missing some details (author name, DOI, etc.) but that's because I still have to sort out proper formatting on the source titles.

image.png (1,920×1,593 px, 829 KB)
image.png (1,920×1,198 px, 377 KB)

Thanks for working on this @Chlod!! Please let us know when you've been able to submit a PR.

@Chlod, I came here from https://phabricator.wikimedia.org/T399642, what you've done is very impressive! We've been building a source verification tool and we'd love to access TWL sources in the same way the Copyvio Detector accesses it - but rather than checking for copyright violations we're chekcing whether sources support claims. Currently the tool only works with online sources and we've been getting a lot of feedback "can you make it work with TWL?"

Talk:The_Wikipedia_Library#AI_Subscriptions
Wikipedia_talk:Featured_article_candidates#c-RoySmith-20260511154400-Alaexis-20260511153100
User_talk:Alaexis/AI_Source_Verification#The_Wikipedia_Library_workflow

what's the right forum to discuss this?

The EDS integration is something we've worked out with @Samwalton9-WMF who manages The Wikipedia Library. EBSCO maintains contact with his team and agreed to give us an API key for copyright enforcement purposes. Probably hit Sam up with an email (or Sam can tell you where the best place would be?)