Page MenuHomePhabricator

Newspapers.com blocks citoid
Closed, ResolvedPublicBUG REPORT

Description

Since about 2026-03-12, citation fetching for Newspapers.com clipping links has totally broken. It was working earlier this week when I did [[w:Star Channel (Canada)]], but it's totally broken. API says it gets a 404 from the site.

Given the volume of Newspapers.com citations I use, this is a high-priority issue on my end.

Logs

It looks like a big increase in requests for the last two weeks, and then we start getting a 403 forbidden. So it looks like a deliberate block of citoid requests from newspapers.com possibly due to them detecting increased traffic from us.

Screenshot From 2026-03-17 12-55-30.png (434×959 px, 26 KB)

Event Timeline

My hunch is that this is some sort of new bot detection layer as the OpenGraph validator suggests the page is titled "Just a moment..."

I have a sample clipping for use: https://www.newspapers.com/article/saint-john-times-globe-cable-tv-operator/192943609/

Probably similar to T362379 etc.

To confirm: https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fwww.newspapers.com%2Farticle%2Fnatchez-gazette%2F192036290%2F?action=query&format=json gets us {"error":"Unable to load URL https://www.newspapers.com/article/natchez-gazette/192036290/"}

So I'd suspect they're blocking traffic from citoid.

This probably is going to end up with us contacting them at some level. After all, they are a TWL partner. Maybe they can whitelist citoid.

Tagging @Samwalton9-WMF and @sjvipin

It looks like a big increase in requests for the last two weeks, and then we start getting a 403 forbidden. So it looks like possibly a traffic-related block.

Screenshot From 2026-03-17 12-55-30.png (434×959 px, 26 KB)

Mvolz renamed this task from Newspapers.com URLs result in 404 to Newspapers.com blocks citoid.Tue, Mar 17, 1:24 PM
Mvolz updated the task description. (Show Details)

We'll speak to our contacts, hopefully this will be a quick fix.

I quickly had a look to see if we can find out where the extra traffic is coming from and it seems to be toolforge, specifically https://link-dispenser.toolforge.org/

Tagging @Soda - is this unusual level of activity or kind of of activity for your tool? (I.e. is this potentially abuse from a user? Or just normal activity?)

Screenshot From 2026-03-17 16-23-04.png (434×959 px, 37 KB)

This amount of activity is more than 5 times the amount we normally get from all sources and it's not just newspapers.com, this is all websites.

Screenshot From 2026-03-17 16-23-04.png (434×959 px, 37 KB)

This needs to be attached to the ticket.

I quickly had a look to see if we can find out where the extra traffic is coming from and it seems to be toolforge, specifically https://link-dispenser.toolforge.org/

Tagging @Soda - is this unusual level of activity or kind of of activity for your tool? (I.e. is this potentially abuse from a user? Or just normal activity?)

Regardless of the activity of the source, perhaps it would be a good idea for link-dispenser to have some internal list of known-good sources that it doesn't bother checking?

I quickly had a look to see if we can find out where the extra traffic is coming from and it seems to be toolforge, specifically https://link-dispenser.toolforge.org/

Tagging @Soda - is this unusual level of activity or kind of of activity for your tool? (I.e. is this potentially abuse from a user? Or just normal activity?)

Regardless of the activity of the source, perhaps it would be a good idea for link-dispenser to have some internal list of known-good sources that it doesn't bother checking?

This does seem like usual levels of activity. That being said, the tool does not require Auth at the moment. I'll implement that ASAP and see if that helps.

Screenshot From 2026-03-17 16-23-04.png (434×959 px, 37 KB)

This amount of activity is more than 5 times the amount we normally get from all sources and it's not just newspapers.com, this is all websites.

Can't see the image... but some of this should be expected traffic. The tool queries Citoid to resolve the title and determine whether a particular source is hallucinated. If the Citoid title and the title on the citation is dissimilar then we flag it as "potentially LLM generated".

I quickly had a look to see if we can find out where the extra traffic is coming from and it seems to be toolforge, specifically https://link-dispenser.toolforge.org/

Tagging @Soda - is this unusual level of activity or kind of of activity for your tool? (I.e. is this potentially abuse from a user? Or just normal activity?)

Regardless of the activity of the source, perhaps it would be a good idea for link-dispenser to have some internal list of known-good sources that it doesn't bother checking?

This does seem like usual levels of activity. That being said, the tool does not require Auth at the moment. I'll implement that ASAP and see if that helps.

Screenshot From 2026-03-17 16-23-04.png (434×959 px, 37 KB)

This amount of activity is more than 5 times the amount we normally get from all sources and it's not just newspapers.com, this is all websites.

Can't see the image... but some of this should be expected traffic. The tool queries Citoid to resolve the title and determine whether a particular source is hallucinated. If the Citoid title and the title on the citation is dissimilar then we flag it as "potentially LLM generated".

Fixed, whoops! Thanks for looking into this.

The tool is behind OAuth2 now; in theory, unintended requests should drop off a cliff. I'll keep monitoring :)

The tool is behind OAuth2 now; in theory, unintended requests should drop off a cliff. I'll keep monitoring :)

Awesome! Thank you so much!