Page MenuHomePhabricator

Citoid ISBN lookup not working
Closed, DeclinedPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Attempt to get citation data for an ISBN using Citoid

Examples:

What happens?:

{
    "type": "https://mediawiki.org/wiki/HyperSwitch/errors/not_found",
    "title": "Not found.",
    "method": "get",
    "uri": "/en.wikipedia.org/v1/data/citation/mediawiki/978-0-19-861412-8"
}
HTTP/2 404 Not Found
content-type: application/problem+json
access-control-allow-origin: *
access-control-allow-headers: accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding
access-control-expose-headers: etag
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
content-security-policy: default-src 'none'; frame-ancestors 'none'
x-content-security-policy: default-src 'none'; frame-ancestors 'none'
x-webkit-csp: default-src 'none'; frame-ancestors 'none'
vary: Accept-Encoding
date: Thu, 24 Feb 2022 18:49:56 GMT
server: restbase1023
content-location: https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/978-0-19-861412-8
access-control-allow-methods: GET,HEAD
referrer-policy: origin-when-cross-origin
cache-control: private, max-age=0, s-maxage=0, must-revalidate
content-length: 173
age: 0
x-cache: cp1087 miss, cp1077 pass
x-cache-status: pass
server-timing: cache;desc="pass", host;desc="cp1077"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
x-client-ip: ***
X-Firefox-Spdy: h2

What should have happened instead?:
Citation data should be returned, like https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/10.1093%2Fref%3Aodnb%2F14155 (which is the same resource as the third example.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Event Timeline

AntiCompositeNumber renamed this task from Citoid ISBN lookup working to Citoid ISBN lookup not working.Feb 24 2022, 6:56 PM
AntiCompositeNumber added a project: Citoid.

It's probable we ran into our request limit with worldcat, but I can't find any obvious spike on the 24th like we've had in the past with bots eating up our requests. It's working for now - we should perhaps improve metrics/ tracking to alert us when we go over our request limit.

Down again. Would it be possible to return an error other than 404 for this? I have a few (infrequently run) integration tests that use this API, and it would be useful to be able to detect when this happens.

@akosiaris - looks like for the last week or so we're getting hammered for isbn requests so much it's made the current error rate like 70% -> https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-30d&to=now&refresh=5m

Any chance you could take a look?

@akosiaris - looks like for the last week or so we're getting hammered for isbn requests so much it's made the current error rate like 70% -> https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-30d&to=now&refresh=5m

Any chance you could take a look?

The request rate jumped from ~0.6 rps average to ~4 rps at spikes so it's not exactly what I would call "getting hammered". There is indeed a single IP that is doing all that traffic and I did ban it in the interest of un-breaking the service but :

  • It's an AWS IP, if they are persistent they can just get a new one in that cloud provider or any other cloud provider. That would make it a whack-a-mole game and it's not in our best interest to play that game.
  • They are clearly within both API https://www.mediawiki.org/wiki/API:Etiquette AND REST API limits (https://www.mediawiki.org/wiki/API:Etiquette). They are above the limits we discussed in T294010#7604614 (but never implemented?) though, so we need to update that in order to have something public about this.

But they are probably NOT even aware of what they are doing, just using an open source library. The user-agent in those requests is "isbnlib" and sure enough a google search returns: https://pypi.org/project/isbnlib/. Now look at the descriptions of functions meta(isbn, service='default') and editions(isbn, service='merge')

May I suggest you reach out to the writer of that library and inform them that we aren't able to be an endpoint for their library?

Copying from the other thread -

It's ~70 rps at those peaks. They are most definitely violating https://www.mediawiki.org/wiki/API:Etiquette (even if we don't have hard numbers in that page) and we can take action against that. A quick look at turnilo shows a single AWS IP with a user agent of Apache-HttpClient/4.5.6 (Java/1.8.0_265) doing the vast majority of these calls in the last day (>85%).

I 've gone ahead and added them to our abuser lists (in the private repo). It will take some 30 minutes to propagate fully, but after that they should receive back a 403 asking them to contact noc@wikimedia.org.

To be fair, https://en.wikipedia.org/api/rest_v1/ says limit it to 200 r/s.

I had forgotten about that. Good point.

Unfortunately even 1r/s would eat up our quota. We could add more specific documentation to citoid on the page.

Yup, we should do that. Having a global recommendation that is incompatible with more specific services like Citoid isn't good decorum. Thanks for bringing it up!

Ideas on sensible limits? Can we handle 1r/s for urls, and for isbns set a daily cap or something?

We can't currently implement that unfortunately, but saying something like 100 requests/day is ok. As far as sensible limits go... 1 rps for urls sounds fine by me. As far as ISBNs goes, it's all about how many users we want to support. e.g. 1000 rpd allows space for 50 users daily maxing out their cap. Which, taking history into account, has low chances of happening while also allowing a heavy human user to perform quite a bit of ISBN lookups.

I've written these limits in this PR here - https://github.com/wikimedia/restbase/pull/1301

It seems to be working smoothly again. Is it the magic of the limit?

Closing as we're not longer using this api.