Page MenuHomePhabricator

Automatic citation generation using ISBN on Wikipedia doesn't work
Closed, ResolvedPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Add a citation
  • Automatic
  • Insert ISBN number

What happens?:
An error code shows, stating "We couldn't make a citation for you. You can create one manually using the "Manual" tab above."

This happens in user sandbox as well as NS0, on both English and Swedish Wikipedia, by several users and both logged in and not logged in.

What should have happened instead?:
A citation should be automatically generated.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:

Event Timeline

ppelberg triaged this task as Unbreak Now! priority.Oct 21 2021, 3:40 PM
ppelberg moved this task from To Triage to Triaged on the VisualEditor board.

I have checked and can confirm locally that the key we use for worldcat in production is no longer working, whereas our other keys still work. However, according to Worldcat dashboard the key has not exceeded usage and is not blocked. I can try rotating the key and see if that works, otherwise will need to get in touch with support. It might have just been dropped from their db or something and reissuing might do the trick.

matmarex renamed this task from ISBN bug on Wikipedia to Automatic citation generation using ISBN on Wikipedia doesn't work.Oct 21 2021, 4:53 PM

Mentioned in SAL (#wikimedia-operations) [2021-10-21T17:53:50Z] <mutante> citoid - replaced "wskey" for worldcat in private repo as requested on T294010 (is in 4 places, 3 for deployment_server/k8s and one remnant for scb)

Thank you to @Dzahn in ops for rotating the keys! It looks all fixed now.

Unfortunately, this issue seems to remain unsolved or the bug has returned. At least on SVWP.

If the problem was solved and has come back, I would imagine we get banned from their systems for some reason?

If the problem was solved and has come back, I would imagine we get banned from their systems for some reason?

I've now filed a ticket with worldcat support.

According to WorldCat we're probably exceeding a usage cap of 50,000 requests a day.

Can we ask for an exemption for Wikipedia?

Can we ask for an exemption for Wikipedia?

I think first we need to look at the traffic. I'm skeptical it's organic. There's something weird going on here, we're seeing huge spikes in requests for ISBN daily at around 20:00

Whole week - https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-7d&to=now&refresh=5m

To the tune of 80 requests/s

By contrast, September:

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-1M%2FM&to=now-1M%2FM&refresh=5m

We never had this many organic requests for isbn, substantially WELL under 1 request per second. Something is eating our requests and I don't think it's us. @akosiaris

This paints a nice picture. https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=1635185829182&to=1635189452614 something is eating up our isbn requests for about a half hour, they keep going for a half our after that with all the requests getting 404ed, and then the requests stop.

Here's a corresponding log file from the logs - https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=h@afe00e9&_a=h@7a3988c

I'm not sure where to go from there, can ops help? :(

Mvolz removed Mvolz as the assignee of this task.Oct 29 2021, 9:25 AM

This paints a nice picture. https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=1635185829182&to=1635189452614 something is eating up our isbn requests for about a half hour, they keep going for a half our after that with all the requests getting 404ed, and then the requests stop.

Agreed.

Here's a corresponding log file from the logs - https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=h@afe00e9&_a=h@7a3988c

Unfortunately kibana copy pasted urls don't work. The interface spews out an error saying "Error restoring state from URL. Unable to completely restore the URL, be sure to use the share functionality." Mind doing so? Thanks!

I'm not sure where to go from there, can ops help? :(

It's ~70 rps at those peaks. They are most definitely violating https://www.mediawiki.org/wiki/API:Etiquette (even if we don't have hard numbers in that page) and we can take action against that. A quick look at turnilo shows a single AWS IP with a user agent of Apache-HttpClient/4.5.6 (Java/1.8.0_265) doing the vast majority of these calls in the last day (>85%).

I 've gone ahead and added them to our abuser lists (in the private repo). It will take some 30 minutes to propagate fully, but after that they should receive back a 403 asking them to contact noc@wikimedia.org.

akosiaris lowered the priority of this task from Unbreak Now! to Low.Oct 29 2021, 3:01 PM

With that mitigation in place I am switching priority to Low for monitoring (and checking out if the user reaches out to us). Hopefully worldcat will soon have refreshed the quotas for our key. If not, we might need another new one.

Looking at https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=43&orgId=1&from=now-30d&to=now I think the ban worked. I haven't seen anyone yet reaching out to noc@wikimedia.org but we no longer have that amount of excessive ISBN citation usage. I don't know if we are under quota again though and if everything is working as expected. I did a quick test and it seems to work. @Mvolz, care to have a look?

Looking at https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=43&orgId=1&from=now-30d&to=now I think the ban worked. I haven't seen anyone yet reaching out to noc@wikimedia.org but we no longer have that amount of excessive ISBN citation usage. I don't know if we are under quota again though and if everything is working as expected. I did a quick test and it seems to work. @Mvolz, care to have a look?

Everything looks good to me now.

akosiaris claimed this task.

Looking at https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=43&orgId=1&from=now-30d&to=now I think the ban worked. I haven't seen anyone yet reaching out to noc@wikimedia.org but we no longer have that amount of excessive ISBN citation usage. I don't know if we are under quota again though and if everything is working as expected. I did a quick test and it seems to work. @Mvolz, care to have a look?

Everything looks good to me now.

OK, I am gonna resolve this and set an event (E1433) to unblock that IP in a month. It's a AWS IP, there is little point in keeping it blocked indefinitely, anyone can end up having it after a while as these IPs can easily get recycled.

It's ~70 rps at those peaks. They are most definitely violating https://www.mediawiki.org/wiki/API:Etiquette (even if we don't have hard numbers in that page) and we can take action against that. A quick look at turnilo shows a single AWS IP with a user agent of Apache-HttpClient/4.5.6 (Java/1.8.0_265) doing the vast majority of these calls in the last day (>85%).

I 've gone ahead and added them to our abuser lists (in the private repo). It will take some 30 minutes to propagate fully, but after that they should receive back a 403 asking them to contact noc@wikimedia.org.

To be fair, https://en.wikipedia.org/api/rest_v1/ says limit it to 200 r/s. Unfortunately even 1r/s would eat up our quota. We could add more specific documentation to citoid on the page.

It's ~70 rps at those peaks. They are most definitely violating https://www.mediawiki.org/wiki/API:Etiquette (even if we don't have hard numbers in that page) and we can take action against that. A quick look at turnilo shows a single AWS IP with a user agent of Apache-HttpClient/4.5.6 (Java/1.8.0_265) doing the vast majority of these calls in the last day (>85%).

I 've gone ahead and added them to our abuser lists (in the private repo). It will take some 30 minutes to propagate fully, but after that they should receive back a 403 asking them to contact noc@wikimedia.org.

To be fair, https://en.wikipedia.org/api/rest_v1/ says limit it to 200 r/s.

I had forgotten about that. Good point.

Unfortunately even 1r/s would eat up our quota. We could add more specific documentation to citoid on the page.

Yup, we should do that. Having a global recommendation that is incompatible with more specific services like Citoid isn't good decorum. Thanks for bringing it up!

It's ~70 rps at those peaks. They are most definitely violating https://www.mediawiki.org/wiki/API:Etiquette (even if we don't have hard numbers in that page) and we can take action against that. A quick look at turnilo shows a single AWS IP with a user agent of Apache-HttpClient/4.5.6 (Java/1.8.0_265) doing the vast majority of these calls in the last day (>85%).

I 've gone ahead and added them to our abuser lists (in the private repo). It will take some 30 minutes to propagate fully, but after that they should receive back a 403 asking them to contact noc@wikimedia.org.

To be fair, https://en.wikipedia.org/api/rest_v1/ says limit it to 200 r/s.

I had forgotten about that. Good point.

Unfortunately even 1r/s would eat up our quota. We could add more specific documentation to citoid on the page.

Yup, we should do that. Having a global recommendation that is incompatible with more specific services like Citoid isn't good decorum. Thanks for bringing it up!

Ideas on sensible limits? Can we handle 1r/s for urls, and for isbns set a daily cap or something?

Looking at https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=43&orgId=1&from=now-30d&to=now I think the ban worked. I haven't seen anyone yet reaching out to noc@wikimedia.org but we no longer have that amount of excessive ISBN citation usage. I don't know if we are under quota again though and if everything is working as expected. I did a quick test and it seems to work. @Mvolz, care to have a look?

Everything looks good to me now.

OK, I am gonna resolve this and set an event (E1433) to unblock that IP in a month. It's a AWS IP, there is little point in keeping it blocked indefinitely, anyone can end up having it after a while as these IPs can easily get recycled.

I 've unblocked the IP, as it did not make sense to keep on blocking a reusable AWS IP indefinitely. If they show up again causing issues, it's easy to block them again.

It's ~70 rps at those peaks. They are most definitely violating https://www.mediawiki.org/wiki/API:Etiquette (even if we don't have hard numbers in that page) and we can take action against that. A quick look at turnilo shows a single AWS IP with a user agent of Apache-HttpClient/4.5.6 (Java/1.8.0_265) doing the vast majority of these calls in the last day (>85%).

I 've gone ahead and added them to our abuser lists (in the private repo). It will take some 30 minutes to propagate fully, but after that they should receive back a 403 asking them to contact noc@wikimedia.org.

To be fair, https://en.wikipedia.org/api/rest_v1/ says limit it to 200 r/s.

I had forgotten about that. Good point.

Unfortunately even 1r/s would eat up our quota. We could add more specific documentation to citoid on the page.

Yup, we should do that. Having a global recommendation that is incompatible with more specific services like Citoid isn't good decorum. Thanks for bringing it up!

Ideas on sensible limits? Can we handle 1r/s for urls, and for isbns set a daily cap or something?

We can't currently implement that unfortunately, but saying something like 100 requests/day is ok. As far as sensible limits go... 1 rps for urls sounds fine by me. As far as ISBNs goes, it's all about how many users we want to support. e.g. 1000 rpd allows space for 50 users daily maxing out their cap. Which, taking history into account, has low chances of happening while also allowing a heavy human user to perform quite a bit of ISBN lookups.