Page MenuHomePhabricator

Several major news websites (NYT, NPR, Reuters...) block citoid
Open, Needs TriagePublic

Description

A number of websites have blocked Citoid due to the high volume of traffic its activity is placing on their websites. This results in Citoid errors when attempting to cite content by inserting these websites' URLs.

Known blocked websites

  • New York Times (T323169)
  • NPR (T362873)
  • Reuters
  • Elsevier ScienceDirect

Original description
first seen today at an event: https://en.wikipedia.org/wiki/Special:Diff/1218432547

later during same event had a problem with NY times. https://en.wikipedia.org/wiki/Special:Diff/1218452300

I went home, pulled a link off NY times front page and tried a test at [[Wikipedia:Sandbox]]. (didn't save)
link: https://www.nytimes.com/2024/04/11/us/politics/spirit-aerosystems-boeing-737-max.html
error message: We couldn't make a citation for you. You can create one manually using the "Manual" tab above.

NY times was definitely working here, (2024-02-13) this URL also now broken: https://en.wikipedia.org/wiki/Special:Diff/1207056572

Event Timeline

It has also stopped working with Reuters.

Esanders renamed this task from citoid errors inserting new ref in VE to citoid failing to create refs for major sites (NYT, NPR).Thu, Apr 18, 5:47 PM
Esanders renamed this task from citoid failing to create refs for major sites (NYT, NPR) to citoid failing to create refs for major sites (NYT, NPR, Reuters...).
Mvolz renamed this task from citoid failing to create refs for major sites (NYT, NPR, Reuters...) to Several major sites (NYT, NPR, Reuters...) block citoid.Fri, Apr 19, 1:35 PM

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

This is partly a consequence of the fact that over the last few years our traffic has increased a lot, we didn't used to trigger IP blocks as often.

A possible solution would be to close off the API, but that would mean we'd no longer support things like reftoolbar.

We may also want to look into adding blacklists for websites who have expressed they definitely do not want us accessing them to be respectful of that.

Mvolz renamed this task from Several major sites (NYT, NPR, Reuters...) block citoid to Several major news websites (NYT, NPR, Reuters...) block citoid due to too much traffic.Fri, Apr 19, 1:40 PM
Mvolz removed Mvolz as the assignee of this task.
Mvolz subscribed.

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

Possibly! It's easiest in cases where The Wikipedia Library has an ongoing dialogue, as is the case with Elsevier, who we're currently talking with about this issue.

This is partly a consequence of the fact that over the last few years our traffic has increased a lot, we didn't used to trigger IP blocks as often.

A possible solution would be to close off the API, but that would mean we'd no longer support things like reftoolbar.

We may also want to look into adding blacklists for websites who have expressed they definitely do not want us accessing them to be respectful of that.

Could we set up a system whereby API keys are manually distributed for tools which are going to be used on Wikimedia projects? I'd hope we could find a middle ground between 'fully open' and 'fully closed'. Unless of course tools like reftoolbar are the primary culprit of this increased traffic.

we could crowdsource the IP addresses which would help with paywalls too. write a JS bookmarklet for people to click while visiting the ref. could then phone home with results or spit out something like JSON to paste into wiki editor somewhere.

but this is odd. I would have thought they'd be happy to share metadata. are they getting so much traffic from us that it's effecting performance??

besides statistics about which tools have this much volume would you also drop some kind of time series graph dashboard with request volume and error volume? thank you.

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

..there's no budget to get a VPN for $2/month and spread the traffic across a dozen IPs? Do you need me to pay for it? I can do that.

If that's a privacy problem, I can get a VPS for €1/month with unlimited traffic.

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

..there's no budget to get a VPN for $2/month and spread the traffic across a dozen IPs? Do you need me to pay for it? I can do that.

I don't think hiding our traffic so that our impact is less obvious is the way to go here, not least because as far as I'm aware Citoid identifies itself in the user agent. Much better for us in the long run to work with these orgs to either educate them on what we're doing, or to figure out ways of reducing our traffic volume. We don't want to burn bridges, especially when many of these orgs are current or potential TWL partners.

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

..there's no budget to get a VPN for $2/month and spread the traffic across a dozen IPs? Do you need me to pay for it? I can do that.

I don't think hiding our traffic so that our impact is less obvious is the way to go here, not least because as far as I'm aware Citoid identifies itself in the user agent. Much better for us in the long run to work with these orgs to either educate them on what we're doing, or to figure out ways of reducing our traffic volume. We don't want to burn bridges, especially when many of these orgs are current or potential TWL partners.

I doubt these blocks are deliberate. Getting their articles linked seems to be in their interest: it'll draw some visitors in and likely improves their search engine ranking. They'll just block any IP that causes too much traffic.

They may quite well suggest to you that instead of them going out of their way to whitelist you, you just use more IPs. If you're worried about how using multiple IPs looks, here's the magic word: load balancing! You can keep the current user agent. If the block is deliberate, you can go the diplomacy route, and if that fails, going rogue is always a last resort.

What about archive.org? Metadata doesn't change much generally, if they have a recent-ish copy you can avoid the traffic to the original site. Also, are you already caching the metadata?

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

..there's no budget to get a VPN for $2/month and spread the traffic across a dozen IPs? Do you need me to pay for it? I can do that.

I don't think hiding our traffic so that our impact is less obvious is the way to go here, not least because as far as I'm aware Citoid identifies itself in the user agent. Much better for us in the long run to work with these orgs to either educate them on what we're doing, or to figure out ways of reducing our traffic volume. We don't want to burn bridges, especially when many of these orgs are current or potential TWL partners.

I doubt these blocks are deliberate. Getting their articles linked seems to be in their interest: it'll draw some visitors in and likely improves their search engine ranking. They'll just block any IP that causes too much traffic.

They may quite well suggest to you that instead of them going out of their way to whitelist you, you just use more IPs. If you're worried about how using multiple IPs looks, here's the magic word: load balancing! You can keep the current user agent. If the block is deliberate, you can go the diplomacy route, and if that fails, going rogue is always a last resort.

We've been explicitly told by at least one organisation that the block is deliberate unfortunately, due to concerns with the volume of traffic. I agree that there are convincing reasons for them not to block us and that it is ultimately in their best interests - we'll just have to see how these conversations go.

While npr.org isn't covered by search engine cache, it seems https://www.nprillinois.org/ is. There's no obvious way to rewrite the URLs though, for example compare:

Obviously you'd have to be careful not to get banned by Bing and Yandex. And obviously this solution works only for NPR and only as long as that domain allows search engine cache. So archive.org would be easier/more versatile, but search engine cache is probably more complete.

How about scraping RSS feeds every 10 minutes and caching the results? That's very little traffic. Includes titles, links, author and date.

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

Possibly! It's easiest in cases where The Wikipedia Library has an ongoing dialogue, as is the case with Elsevier, who we're currently talking with about this issue.

This is partly a consequence of the fact that over the last few years our traffic has increased a lot, we didn't used to trigger IP blocks as often.

A possible solution would be to close off the API, but that would mean we'd no longer support things like reftoolbar.

We may also want to look into adding blacklists for websites who have expressed they definitely do not want us accessing them to be respectful of that.

Could we set up a system whereby API keys are manually distributed for tools which are going to be used on Wikimedia projects? I'd hope we could find a middle ground between 'fully open' and 'fully closed'. Unless of course tools like reftoolbar are the primary culprit of this increased traffic.

We could model ourselves after the crossRef API: https://www.crossref.org/documentation/retrieve-metadata/rest-api/tips-for-using-the-crossref-rest-api/

The issue of having api keys and giving them to reftoolbar is there is no way to store secrets on wiki! It's not private in the least. It would end up being security via obscurity and trusting that people a) don't either steal the publicly viewable key or b) use the toolforge service which uses the key. Which, might be enough, really. But if we are still letting people other than for on wiki stuff on purpose, it's still going to be an issue.

How about scraping RSS feeds every 10 minutes and caching the results? That's very little traffic. Includes titles, links, author and date.

I'm not sure if they'd prefer it, but it is possible we could start storing this metadata ourselves. For instance, there's been a long standing interest in this project using i.e. wikidata and storing it on wikidata with cite Q. Unfortunately it hasn't really gone anywhere... a first pass might be to see how it works for dois since we have a lot of those on wikidata currently, i.e. instead of crossref (who very nicely let us get away with a lot of traffic on the "polite" config and have for quite some time even though we already have it stored on wikidata. They're fast, though!)

But for news sites, I'm not sure it helps much since we probably only cite a very small fraction of the total number of articles...

besides statistics about which tools have this much volume would you also drop some kind of time series graph dashboard with request volume and error volume? thank you.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

Possibly! It's easiest in cases where The Wikipedia Library has an ongoing dialogue, as is the case with Elsevier, who we're currently talking with about this issue.

This is partly a consequence of the fact that over the last few years our traffic has increased a lot, we didn't used to trigger IP blocks as often.

A possible solution would be to close off the API, but that would mean we'd no longer support things like reftoolbar.

We may also want to look into adding blacklists for websites who have expressed they definitely do not want us accessing them to be respectful of that.

Could we set up a system whereby API keys are manually distributed for tools which are going to be used on Wikimedia projects? I'd hope we could find a middle ground between 'fully open' and 'fully closed'. Unless of course tools like reftoolbar are the primary culprit of this increased traffic.

We could model ourselves after the crossRef API: https://www.crossref.org/documentation/retrieve-metadata/rest-api/tips-for-using-the-crossref-rest-api/

The issue of having api keys and giving them to reftoolbar is there is no way to store secrets on wiki! It's not private in the least. It would end up being security via obscurity and trusting that people a) don't either steal the publicly viewable key or b) use the toolforge service which uses the key. Which, might be enough, really. But if we are still letting people other than for on wiki stuff on purpose, it's still going to be an issue.

Ah yes, I'd forgotten that Reftoolbar is a script, I see how that complicates things.

besides statistics about which tools have this much volume would you also drop some kind of time series graph dashboard with request volume and error volume? thank you.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now

I see ~3 requests/second, which is presumably for all sites (NYT, Reuters, NPR, etc) combined. Let's be generous and say a single site is taking one third of the requests. Are organizations like Reuters seriously concerned by ONE request per second?

Note: the Editing Team is thinking through a couple of ways we could incrementally improve the current experience.

You can expect to see another comment from @Esanders, @Mvolz , or me before this week is over outlining those approaches.

Note: the Editing Team is thinking through a couple of ways we could incrementally improve the current experience.

You can expect to see another comment from @Esanders, @Mvolz , or me before this week is over outlining those approaches.

While the Editing Team has not yet converged on a set of potential solutions that could enable people to reliably use Citoid to generate references for major news websites, the Editing Team has identified two short-term interventions that could improve the current experience...

InterventionTicketReference(s)
Revise the current error message to explicitly state why people are encountering itT364594Current error message:
CitoidErrorMessage-Current.png (318×458 px, 94 KB)
Leverage Edit Check infrastructure to offer people a call to action from within CitoidT364595We're imagining something similar to the current Reference Reliability experience:
image.png (1×2 px, 239 KB)

The NYTimes has been blocking us for a while, it briefly worked when we changed datacenters and ergo IP, but they've understandably reblocked us after a few weeks' reprieve!

There's not a whole lot we can do except to ask for IP exemptions - @Samwalton9-WMF would this be something partnerships could try?

Possibly! It's easiest in cases where The Wikipedia Library has an ongoing dialogue, as is the case with Elsevier, who we're currently talking with about this issue.

This is partly a consequence of the fact that over the last few years our traffic has increased a lot, we didn't used to trigger IP blocks as often.

A possible solution would be to close off the API, but that would mean we'd no longer support things like reftoolbar.

We may also want to look into adding blacklists for websites who have expressed they definitely do not want us accessing them to be respectful of that.

Could we set up a system whereby API keys are manually distributed for tools which are going to be used on Wikimedia projects? I'd hope we could find a middle ground between 'fully open' and 'fully closed'. Unless of course tools like reftoolbar are the primary culprit of this increased traffic.

We could model ourselves after the crossRef API: https://www.crossref.org/documentation/retrieve-metadata/rest-api/tips-for-using-the-crossref-rest-api/

The issue of having api keys and giving them to reftoolbar is there is no way to store secrets on wiki! It's not private in the least. It would end up being security via obscurity and trusting that people a) don't either steal the publicly viewable key or b) use the toolforge service which uses the key. Which, might be enough, really. But if we are still letting people other than for on wiki stuff on purpose, it's still going to be an issue.

Whereas it could help if the tool was available only to wiki logged-in users (aka editors) as a precaution, I have my doubts this behavior is because of high traffic in all cases, see below.

besides statistics about which tools have this much volume would you also drop some kind of time series graph dashboard with request volume and error volume? thank you.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now

I see ~3 requests/second, which is presumably for all sites (NYT, Reuters, NPR, etc) combined. Let's be generous and say a single site is taking one third of the requests. Are organizations like Reuters seriously concerned by ONE request per second?

I have the same question. Instead, I 'd suggest that in some (if not many) cases Citoid is instead triggering some bot detection mechanism (those have proliferated in the last few years) and served some Captcha first by some CDN (e.g. Fastly) which it obviously can't (and shouldn't) solve. The best solution is probably to ask that it isn't classified as a bot.

Let me note that Citoid utilizes Zotero under the hood and we have very little visibility into what Zotero does. Zotero's Grafana dashboard is pretty empty as Zotero doesn't expose metrics AFAIK. If 1 request to Citoid ends up being multiplied by 10 or 20 by Zotero it could be that in some cases it triggers some high traffic detection mechanism. I still find it hard to believe though.

Note: the Editing Team is thinking through a couple of ways we could incrementally improve the current experience.

You can expect to see another comment from @Esanders, @Mvolz , or me before this week is over outlining those approaches.

While the Editing Team has not yet converged on a set of potential solutions that could enable people to reliably use Citoid to generate references for major news websites, the Editing Team has identified two short-term interventions that could improve the current experience...

InterventionTicketReference(s)
Revise the current error message to explicitly state why people are encountering itT364594Current error message:
CitoidErrorMessage-Current.png (318×458 px, 94 KB)
Leverage Edit Check infrastructure to offer people a call to action from within CitoidT364595We're imagining something similar to the current Reference Reliability experience:
image.png (1×2 px, 239 KB)

In general it's imho preferable to invest time into just making it work instead of writing better error messages. Maybe like..

/*<nowiki>
This script is public domain, irrevocably released as WTFPL Version 2[www.wtfpl.net/about/] by its author, Alexis Jazz.
I can haz ciatation? Plz?

https://www.sciencedirect.com/science/article/abs/pii/S2468023024002402
https://www.npr.org/2024/03/19/1239528787/female-genital-mutilation-is-illegal-in-the-gambia-but-maybe-not-for-much-longer
https://www.reuters.com/world/africa/gambia-mp-defends-bid-legalise-female-genital-mutilation-2024-04-08/
https://www.nytimes.com/2024/04/11/us/politics/spirit-aerosystems-boeing-737-max.html
*/


hazc={};
async function getSiteSource(url, siteHTTP, siteHTTPtext, urlinfo, sitejson, first1, first2, first3, last1, last2, last3, date, accessdate, title, template, website) {
	url=$('#hazcinput')[0].value;
	console.log('get '+url);
	hazc.siteHTTP = await fetch(url);
	hazc.siteHTTPtext = await hazc.siteHTTP.text();
	hazc.urlinfo=new mw.Uri(url);
	if ( hazc.urlinfo.host.match(/npr\.org/) ){

		sitejson=JSON.parse(hazc.siteHTTPtext.match(/NPR.serverVars = (.*);/)[1]);

		template='cite web';
		website='[[w:en:NPR|NPR]]';
		first1=sitejson.byline[0].match(/([^ ]*)/)[0];
		last1=sitejson.byline[0].match(/([^ ]*) (.*)/)[2];
		if ( sitejson.byline[1] ) {
			first2=sitejson.byline[1].match(/([^ ]*) (.*)/)[1];
			last2=sitejson.byline[1].match(/([^ ]*) (.*)/)[2];
		} else {
			first2='';
			last2='';
		}
		if ( sitejson.byline[2] ) {
			first3=sitejson.byline[2].match(/([^ ]*) (.*)/)[1];
			last3=sitejson.byline[2].match(/([^ ]*) (.*)/)[2];
		} else {
			first3='';
			last3='';
		}
		dateobj=new Date(sitejson.fullPubDate);
		articledate=dateobj.toISOString().replace(/T.*/,'');
		title=sitejson.title;
	} else if ( hazc.urlinfo.host.match(/nytimes\.com/) ) {
		mw.notify('NYtimes TODO, try NPR');
	} else if ( hazc.urlinfo.host.match(/reuters\.com/) ) {
		mw.notify('Reuters TODO, try NPR');
	} else if ( hazc.urlinfo.host.match(/sciencedirect\.com/) ) {
		mw.notify('Sciencedirect TODO, try NPR');
	}
	mw.notify('{{'+template+'|url='+url+'|title='+title+'|website='+website+'|first1='+first1+'|last1='+last1+'|date='+articledate+'|access-date={{subst:#time: Y-m-d }}}}');

}

hazc.input=document.createElement('input');
hazc.input.id='hazcinput';
hazc.input.value='https://www.npr.org/2024/03/19/1239528787/female-genital-mutilation-is-illegal-in-the-gambia-but-maybe-not-for-much-longer';
hazc.input.size='50';
hazc.submit=document.createElement('button');
hazc.submit.id='hazcsubmit';
hazc.submit.innerText='I can haz ciatation?';
$('body').prepend(hazc.input,hazc.submit);

$('#hazcsubmit').on('click',function(){
	console.log('clicked');
	OO.ui.confirm('Ur privacy will be violated cookie jar etc etc').done(function(a){if(a){getSiteSource();}});
});
Mvolz renamed this task from Several major news websites (NYT, NPR, Reuters...) block citoid due to too much traffic to Several major news websites (NYT, NPR, Reuters...) block citoid .Sat, May 11, 6:25 AM

Per https://meta.wikimedia.org/wiki/OWID_Gadget, having our users perform the requests is a non-starter.

This is really weird, I don't understand. they would benefit from giving us the metadata, the metadata isn't a backdoor around their paywall and they don't benefit from hiding it. are we just not talking to the right people there?

Per https://meta.wikimedia.org/wiki/OWID_Gadget, having our users perform the requests is a non-starter.

what's wrong with the bookmarklet version T362379#9729585? it could spit out JSON, copy, paste into VE. or it could make some kind of link to a Wikipedia page which then caches the citation locally and next time you insert with same browser you can use the cache instead of regenerating from scratch.

maybe it wouldn't get a ton of usage because it's a bit more involved to use. but it's better than nothing.

@akosiaris

In the Grafana dash, Saturation -> Total Network says it's about 10 MB/s. Does this count everything the job is doing, including what Zotero might be sending out?

I still think what you said makes sense - I've seen at least one bug where Zotero was stuck in a loop, so knowing what the egress from a single citoid request is would be useful. https://forums.zotero.org/discussion/102507/zotero-causing-continuous-article-download-request-loop

@akosiaris

In the Grafana dash, Saturation -> Total Network says it's about 10 MB/s. Does this count everything the job is doing, including what Zotero might be sending out?

I still think what you said makes sense - I've seen at least one bug where Zotero was stuck in a loop, so knowing what the egress from a single citoid request is would be useful. https://forums.zotero.org/discussion/102507/zotero-causing-continuous-article-download-request-loop

In terms of getting stuck in redirect loops, we do actually test that before we send anything to Zotero - it is the source of at least one extra request (making at least two total requests, one each from citoid and zotero), to see if the resource is there, before we pass it on. Not error proof, though.

We do have logs of all outgoing requests: https://logstash.wikimedia.org/app/dashboards#/view/398f6990-dd56-11ee-9dd7-b17344f977e3?_g=h@c823129&_a=h@19b3870

Looks like there are about 2 citoid requests for every 1 zotero one; no obvious sign we're making a ton of requests to one resource or that Zotero is the culprit. One thing that might help is to make our citoid user agent string more browsery, like the Zotero one does, for when we check for redirects, which might make us run less into automated /algorithmic issues.

@akosiaris

In the Grafana dash, Saturation -> Total Network says it's about 10 MB/s. Does this count everything the job is doing, including what Zotero might be sending out?

Zotero wise:

Last 30 days ranges ~100KB/s transmit and between 200kB/s and 400kB/s receive.

This does indeed count everything that Zotero is sending and receiving. Including health checks. This sum does include traffic that Citoid sends to Zotero, but DOES NOT include traffic that Citoid sends to/receives from the world, other parts of the infrastructure etc

Citoid wise:

Last 30 days ranges from to ~150kB/s for transmit and up to 10-15MB/s receive as you point out. This includes traffic that Citoid receives/sends, including traffic that it sends to /receives from Zotero but similarly to the above it DOES NOT include traffic that Zotero sends to/receives from the world/other parts of the infrastructure etc.

The discrepancy you note is big and intriguing. I can't attribute it to something specific. What I can say it's not something a specific instance does, a quick explore shows that traffic from all 8 instances of Citoid have similar patterns (which is good cause it matches my expectations)

image.png (482×1 px, 201 KB)

@Mvolz correct me if I am wrong, but Citoid relies almost exclusively to Zotero for workloads and all does thing itself if Zotero fails, right ? I find it difficult that health checks could create so much traffic, hence my asking.

I still think what you said makes sense - I've seen at least one bug where Zotero was stuck in a loop, so knowing what the egress from a single citoid request is would be useful. https://forums.zotero.org/discussion/102507/zotero-causing-continuous-article-download-request-loop

Agreed. But both egress and ingress. This 10MB/s traffic is weird.

@Mvolz correct me if I am wrong, but Citoid relies almost exclusively to Zotero for workloads and all does thing itself if Zotero fails, right ? I find it difficult that health checks could create so much traffic, hence my asking.

I still think what you said makes sense - I've seen at least one bug where Zotero was stuck in a loop, so knowing what the egress from a single citoid request is would be useful. https://forums.zotero.org/discussion/102507/zotero-causing-continuous-article-download-request-loop

Agreed. But both egress and ingress. This 10MB/s traffic is weird.

My best guess is this is a pdf thing. They're big and Zotero rejects them (historically trying to load them caused memory problems), after which point I citoid still tries to unsuccessfully scrape them... it's in the works to reject it in citoid, too:

https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/1031870

If it is downloading PDFs, would that affect both ingress and egress?

Perhaps the problem is people trying to cite things like https://pdf.sciencedirectassets.com/271102/1-s2.0-S0014579310X00084/1-s2.0-S0014579309009831/main.pdf

In which case, the fix is simple, we simply don't make the request at all if we see the extension .pdf (sometimes the extension is missing and we have no way of knowing that, but it should get the traffic down). Zotero already isn't doing this, I think.

Not probably the problem for the other sites which don't have pdfs.

EDIT: The way Zotero does this is simply aborts the request if it's getting too much data back: https://github.com/zotero/translation-server/pull/69

A quick iftop in one of the Citoid instances says that the bulk of this 10MB/s traffic is from urldownloader1004. Which is the current proxy that citoid and zotero (and all applications that want to reach the internet) use. So this is probably a result of Citoid requests directly to the outside.

I 'll have a deeper look tomorrow.

I 'll have a deeper look tomorrow.

urldownloaders don't have the visibility needed to look at URLs since most sites are accessed over HTTPS. So the only thing they do see is URL domains, not paths. We need Citoid to log out what requests it sees.

We do have logs of all outgoing requests: https://logstash.wikimedia.org/app/dashboards#/view/398f6990-dd56-11ee-9dd7-b17344f977e3?_g=h@c823129&_a=h@19b3870

This relies on the URL downloaders and suffers from exactly the same problem as pointed out above. Btw, the top 1 domain is accessed so much that it makes 0 sense that someone would try to cite it. Smells like abuse.

I 've calculated the rate of outgoing traffic from urldownloaders to both citoid and zotero based on response sizes and it matches what we see in Grafana. So it's safe to say that this is almost entirely traffic that Citoid generates via requests to makes to the world.

I 've also tried to find if there is any pattern worth singling out in the visited domains, but the top 20 domain in the last 15 hours barely account for 15% of traffic in bytes (in requests, the top 1 visited domain accounts for ~15% of requests but the responses are barely 7.5KB)

But I think all of this is unrelated to the main issue this task is about. Even if incoming Citoid traffic is high (despite requests being low) this doesn't explain why we experience various sites not functioning.

The ditch PDFs thing is a good idea anyway, I 'd say go ahead with it.

But overall, we need Citoid to log the error (including the body) the urldownloaders get from upstream sites.