Page MenuHomePhabricator

Several major news websites (NYT, Reuters...) block citoid
Closed, ResolvedPublic

Description

A number of websites have blocked Citoid due to the high volume of traffic its activity is placing on their websites. This results in Citoid errors when attempting to cite content by inserting these websites' URLs.

Known blocked websites

Strategy and state

Strategy

At present, there are four strategies we are pursuing to ensure volunteers are able to reliably generate citations using Citoid in ways that meet our partners' needs and expectations.

These four strategies are as follows…
1. Align with partners so that we can:

  • Understand their needs and expectations to further improve how Citoid behaves.

2. Improve UX so that we can:

  • Offer volunteers clear path(s) forward when Citoid fails
  • Simplify the steps to generate a reference when Citoid is unable to do so automatically

3. Increase observability so that we can:

  • Swiftly address issues, when they emerge
  • Ensure Citoid is behaving in ways that meet volunteers and partner needs
  • Evaluate the impact of changes we're making to Citoid
NOTE: our ability to observe Citoid is constrained by its interaction with Zotero, a system we don't have full insight into.

4. Reconsider internal assumptions so that we can:

  • Ensure Citoid behaves in ways that accommodate the technical and business constraints that ensure the longevity of partner infrastructure
State

The section contains the actions we are taking, and will consider taking in the future, to deliver the impact described in the Strategy section above.

StrategyTicket(s)DescriptionStatus
Improve Citoid UXT364595Offer people an alternative path for generating citations from within Citoid's error state✅ Done; deployed 12 June 2024
T364594Revising Citoid's error message to be more specific✅ Done; deployed 13 June 2024
Increase observabilityT364901Log data about which domains are failing most frequently✅ Done; data being logged as of ~24 June 2024
T365583Log data when Citoid fails because the media type is (e.g. PDFs) is no supported✅ Done; deployed 12 June 2024
T364903[SPIKE] Determine how specific we can be about logging why Citoid is failing✅ Investigation complete; results informing work in T365583 and T364901
T368802Identify patterns in data now being logged about Citoid performanceUp next
Reconsider internal assumptionsT366093Change Citoid user agent to use same pattern as Zotero✅ Done; deployed 12 June 2024
T367194Citoid/Zotero: Create rate limiting configurable on a per site basisExploring technical feasibility; work not yet prioritized
T367452Reduce Citoid HTTP request volume by using HTTP HEAD instead of HTTP GET✅ Done; deployed week of 17 June 2024
Ticket neededCache metadata results to reduce amount of traffic we're sending to domainsInvestigation required to assess feasibility; this work has not yet been prioritized
Ticket neededEnable people to do the metadata scraping themselves.Investigation required to assess feasibility; this work has not yet been prioritized
Ticket neededWrite Citoid as a layered set of data adaptersInvestigation required to assess feasibility; this work has not yet been prioritized
T95388Fallback to archive.org when Citoid request fails🟢 Investigation is active
Align with partners-Talk with partners directly to understand what they need from Citoid to fulfill the requests people are making with itIn progress

Original description
first seen today at an event: https://en.wikipedia.org/wiki/Special:Diff/1218432547

later during same event had a problem with NY times. https://en.wikipedia.org/wiki/Special:Diff/1218452300

I went home, pulled a link off NY times front page and tried a test at [[Wikipedia:Sandbox]]. (didn't save)
link: https://www.nytimes.com/2024/04/11/us/politics/spirit-aerosystems-boeing-737-max.html
error message: We couldn't make a citation for you. You can create one manually using the "Manual" tab above.

NY times was definitely working here, (2024-02-13) this URL also now broken: https://en.wikipedia.org/wiki/Special:Diff/1207056572

Related Objects

StatusSubtypeAssignedTask
ResolvedSamwalton9-WMF
ResolvedBUG REPORTMvolz
ResolvedMvolz
ResolvedEAkinloose
Resolvedzoe
ResolvedVPuffetMichel
ResolvedEsanders
OpenNone
ResolvedMvolz
ResolvedBUG REPORTMvolz
ResolvedMvolz
Openppelberg
OpenNone
ResolvedBUG REPORTMvolz
OpenNone
ResolvedBUG REPORTNone
ResolvedRyasmeen
OpenNone
ResolvedNone
OpenEsanders
Resolvedcolewhite
Openppelberg
DeclinedNone
Resolvedppelberg
ResolvedMNeisler
ResolvedRyasmeen
OpenNone
Resolveddchan
Resolvedzoe
OpenNone
Openppelberg
Opennayoub
Declineddchan
OpenNone
ResolvedNone
DeclinedNone
Resolvedppelberg
ResolvedBUG REPORTSonjaPerry
OpenBUG REPORTNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1034555 merged by jenkins-bot:

[mediawiki/services/citoid@master] Switch back to using HEAD instead of GET for redirect tracking

https://gerrit.wikimedia.org/r/1034555

@Robertsky We heard back from the property owner who showed us graphs of traffic and traffic levels and said that this was clearly more than what they considered acceptable. Their concern was purely about volume and not about suspicious looking user agent strings.

ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)

Update: 14 June

We've updated the task description to include the Strategy and State section that's meant to help us all understand:

  1. The strategy that is guiding the action we're taking to address this issue
  2. The state of the "actions" mentioned in "1." (above)

If anything you see brings questions/ideas to mind, we'd value knowing.

@Robertsky We heard back from the property owner who showed us graphs of traffic and traffic levels and said that this was clearly more than what they considered acceptable. Their concern was purely about volume and not about suspicious looking user agent strings.

And what kind of volume were they seeing?

And just to be sure: there is no excessive traffic generated by some random (non-Wikimedia) spider/harvester operator who might be using the Citoid user agent?

Why are the details about the problems sites report so vague? 😅
Did T367452 resolve any of these sites problems? Every visit being a double-download might have been enough of a spike for some of them to block.

As editors visit the pages themselves while reading, a longer term solution might be something that captures cites client-side while browsing and lets you post those to a wikibase as Mvolz mentioned, then checks that base first in citoid before pinging out to the web.

For instance, when I'm doing writing that involves dozens of PDFs, the last thing I want is to generate all those cites by hand. But could quickly capture them all while reading them and then insert them while editing.

In T362379#9913315, @Sj wrote:

Why are the details about the problems sites report so vague? 😅

@Sj can you say a bit more what you mean by "vague" in this context? What might you expect us to be able to know about the problems that you've not seen documented?

Did T367452 resolve any of these sites problems? Every visit being a double-download might have been enough of a spike for some of them to block.

Great question! We're investigating this as part of T368802 and will share what we learn...

For instance, when I'm doing writing that involves dozens of PDFs, the last thing I want is to generate all those cites by hand. But could quickly capture them all while reading them and then insert them while editing.

Oh! What a nifty idea...

Could you please read the newly-created T368980 and boldly edit the description to align with what you had in mind...?

Thanks Peter, I commented on T368980; the description is fine.

Vagueness:

  • "told by at least one organisation that the block is deliberate." (which orgs?)
  • "owners of one property said that the traffic pattern looks like abuse... showed us graphs of traffic... [t]heir concern was purely about volume.
" (what patterns? how much volume?) - @AlexisJazz asked about this above

One could imagine smoke tests of our outgoing traffic against a glossary of patterns we've tried to avoid in the past.

ppelberg updated the task description. (Show Details)

Another user report: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#Adding_Hindustan_Times_sources

Looks like Hindustan Times potentially is IP blocking us as it works locally but not in production.

FYI, it seems that straitstimes.com may be blocked now as well.

FYI, it seems that straitstimes.com may be blocked now as well.

Yup. Very clear block starting Nov 5th 2024 visible: https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3M%2Cto%3Anow))

We're not sending them more than 30 requests/day so :/ The day they blocked us, only 3 requests!

FYI, it seems that straitstimes.com may be blocked now as well.

Yup. Very clear block starting Nov 5th 2024 visible: https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3M%2Cto%3Anow))

We're not sending them more than 30 requests/day so :/ The day they blocked us, only 3 requests!

Tested locally and it doesn't work either - might have been an infrastructure change rather than a deliberate block.

FYI, it seems that straitstimes.com may be blocked now as well.

Yup. Very clear block starting Nov 5th 2024 visible: https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3M%2Cto%3Anow))

We're not sending them more than 30 requests/day so :/ The day they blocked us, only 3 requests!

Tested locally and it doesn't work either - might have been an infrastructure change rather than a deliberate block.

Was the user agent string changed around then as well? If so, maybe we can drop an email to them to see if they can unblock the user agent string?

FYI, it seems that straitstimes.com may be blocked now as well.

Yup. Very clear block starting Nov 5th 2024 visible: https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3M%2Cto%3Anow))

We're not sending them more than 30 requests/day so :/ The day they blocked us, only 3 requests!

Tested locally and it doesn't work either - might have been an infrastructure change rather than a deliberate block.

Was the user agent string changed around then as well? If so, maybe we can drop an email to them to see if they can unblock the user agent string?

Sorry, that was ambiguous - I meant an infrastructure change on their end, maybe, because it doesn't work on a different IP and with a different user-agent either.

Sorry, that was ambiguous - I meant an infrastructure change on their end, maybe, because it doesn't work on a different IP and with a different user-agent either.

Interesting. This works with https://citer.toolforge.org though. fyi, their sister site, zaobao.sg also cannot be parsed through citiod.

papers.ssrn.com is another one: (user report here). This appears to be an actual block, AFAIK. https://www.mediawiki.org/wiki/Topic:Ybg5cq7nal5z0dom

Interesting. This works with https://citer.toolforge.org though. fyi, their sister site, zaobao.sg also cannot be parsed through citiod.

This might not be the case here, but almost a year ago, I started to see many sites becoming inaccessible to citer. After some investigation, I found that the issue was mostly caused by TLS fingerprinting processes used by Cloudflare servers, which refuse to respond to non-browser-like TLS handshakes. As a workaround, citer now uses curl_cffi, which is a Python binding for a fork of curl-impersonate which is able to perform TLS and HTTP handshakes in a manner identical to that of real browsers. This change rectified citer's issue for many such sites.

Interesting. This works with https://citer.toolforge.org though. fyi, their sister site, zaobao.sg also cannot be parsed through citiod.

This might not be the case here, but almost a year ago, I started to see many sites becoming inaccessible to citer. After some investigation, I found that the issue was mostly caused by TLS fingerprinting processes used by Cloudflare servers, which refuse to respond to non-browser-like TLS handshakes. As a workaround, citer now uses curl_cffi, which is a Python binding for a fork of curl-impersonate which is able to perform TLS and HTTP handshakes in a manner identical to that of real browsers. This change rectified citer's issue for many such sites.

That's good to know! We're definitely also seeing that for a lot of websites. We're hoping registering as a friendly bot with cloudflare T370118 will ameliorate things... but this hasn't happened yet.

Mvolz renamed this task from Several major news websites (NYT, NPR, Reuters...) block citoid to Several major news websites (NYT, Reuters...) block citoid .Mar 6 2025, 1:13 PM
Samwalton9-WMF renamed this task from Several major news websites (NYT, Reuters...) block citoid to Several major news websites (NYT, Reuters...) block citoid.Oct 16 2025, 10:45 AM
Samwalton9-WMF updated the task description. (Show Details)
Samwalton9-WMF claimed this task.

Since the major issues identified here have been resolved and this ticket has become a big tracking ticket for disparate work, I'm closing it as resolved.

If you face future issues with specific websites, please open individual tickets per website.