Page MenuHomePhabricator

Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare
Closed, ResolvedPublic

Description

Prompted by @DLynch offline, this task involves the work of registering Citoid as a "friendly bot"/"verified bot" with Cloudflare. More information.

Requirements

Open questions

  • 1. What teams – if any – might we need to consult with before submitting Citoid for Cloudflare review?

Event Timeline

Probably need to talk to SRE about the "verification method" section.

image.png (758×1 px, 88 KB)

Hi there, Is there someone in ServiceOps new SRE who can help us with this?
Thank you in advance!

Someone in ServiceOps new probably knows the answer to this but I don't, at least not confidently. Here's a rough stab:

Citoid uses urldownloader so it might be sufficient to provide a list of the IP addresses for urldownloader[1003-1004,2003-2004].wikimedia.org. (In the future we'd need to update it occasionally, which would be a drag, or else generate that list of IP addresses on an HTTP endpoint somewhere for Cloudflare to scrape periodically, probably on noc.wm.o.)

The only shared subdomain is wikimedia.org. If Cloudflare is willing to allow our whole domain via reverse DNS, that's definitely the easiest approach. (If I would Cloudflare, I would ask about user-provided workloads -- I not positive offhand that there are no Cloud workloads whose IP address reverse-resolves to wikimedia.org, but I think there should be none.) If not, maybe they can allowlist a glob like urldownloader*.wikimedia.org. If not, reverse DNS probably won't work.

Their blog post also mentions they might use UA + ASN in some cases -- if they'll do that for us, that's works too (we're AS 14907). But Cloud traffic does come from the same AS as urldownloader, so it might not be a great option.

@akosiaris What am I missing?

What teams – if any – might we need to consult with before submitting Citoid for Cloudflare review?

Just for awareness, we also use Cloudflare's Magic Transit service for DDoS protection, so SRE has some contacts with them in the Cachebusting WG and in Netops, but I don't think that'll be relevant here and you don't need to check in with them or anything.

Hi,

Couple of notes here:

IP List

The blog says "These IPs must be publicly documented and exclusive to your bot. If you provide a shared IP address (like one used by a proxy service), our systems will detect risk and refuse to cooperate. We want to avoid accidentally allowing other traffic."

We definitely allow other "bots" to egress via urldownloader (our proxy service). There's at least

(not including Citoid and Zotero)

All of them good boots. I don't know how strict they are about this, but it's improbable we 'll setup a dedicated instance per bot and go through this process for every bot.

That being said, if they are a bit lenient on this, this is an approach we can follow. We can tell them eqiad+codfw IP spaces to allow ourselves the flexibility to do various maintenance work during upgrades et al on the backing infrastructure of urldownloader

rDNS

If they can accept a glob/regexp, I think we can agree to stick to urldownloader*.wikimedia.org. forever. This would indeed be pretty easy for everyone then. Specific hostnames, we probably want to avoid though.

User-Agent + ASN

This one has the issue that @RLazarus pointed out. WMCS is also originating from the same AS as production where people ran any kind of both

@ppelberg @DLynch do you have any kind of contacts that we could ask some clarification regarding the 2 questions above? Namely whether the rDNS can be matching against a regexp/glob and whether we can have >1 bots from the same IP list.

Adding some more info, I 've went to https://dash.cloudflare.com/?to=/:account/:zone/security/bots with a personal free account I have and of course there is no section to tell them about my bot as the blog suggests. Maybe an account with more privileges than a free account is required.

@akosiaris with the wikimedia account we have we do have access to the Add Verified Bot form and potentially we could compile that one. The verification methods in that case are:

  • Reverse DNS
  • IP List
  • ASN

I have the screenshots of the full form, not sure if it's ok to paste them here as it's a public task and those are private pages on CF side.

akosiaris renamed this task from Register Citoid as a "friendly bot" with Cloudflare to Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.Jul 24 2024, 10:48 AM
akosiaris updated the task description. (Show Details)

OK, thanks I can see that too now thanks.

I 've been collecting information and tried to submit a bot getting back this interesting validation error

A reverse DNS must be a domain suffix in the format .subdomain.example.com

So, we can just have .wikimedia.org apparently

@ppelberg, @DLynch @zoe. The verified bot form requires entering some input we need your help on.

Easiest question is: which category of the following options that cloudflare offers feels the best to you?

Academic Research
Accessibility
Advertising & Marketing
Aggregator
AI Crawler
Feed Fetcher
Monitoring & Analytics
Page Preview
Search Engine Crawler
Search Engine Optimization
Security
Social Media Marketing
Webhooks
Other

The next one would be, do we differentiate between Zotero and Citoid? As in, we file a single request or 2? On our side both are valid approaches, the question is mostly on whether you want Cloudflare to treat them like 1 entity of 2 different ones (and I admit I have no idea what that would mean).

Naming wise, I am gonna opt for prefixing the request(s) with Wikimedia to hopefully distinguish it from other Zotero/Citoid installations (I know Zotero is used in other venues too, and Citoid could well be too, since it's Open Source, even if I never head of any). This is going to be public in https://radar.cloudflare.com/traffic/verified-bots

User-Agent wise, I assume for Citoid we are ok with the content https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1034860 and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/zotero/+/refs/heads/master/config/production.json5#6 for Zotero.

Note that they offer the ability to submit match patterns, not just a single string, so if you want to utilize that functionality, let us know.

Generally Citoid will make a couple of requests to resolve redirects and the like, and if it gets a 200 from a HEAD request it then asks Zotero to fetch the page and do its thing. If Zotero fails, Citoid will then make a further HEAD and GET request and run through its own, slightly less specialised, citation extraction routine.

From the perspective of the receiving end I think it'll all look like one bot.

One wrinkle is that I think some third parties do use our Citoid server. It's still for the rather benign purpose of generating citations, though. I imagine that this'll fall in the "academic research" or "other" categories depending?

OK, I 've group them into 1, named it Wikimedia Citation Bot, submitting the two different User-Agent headers from the links above. as well as a 2 match patterns that would match always, that is ZoteroTranslationServer/WMF and Citoid/WMF. I 've also provided a link to https://www.mediawiki.org/wiki/Citoid, copying to a short description field the first sentence.

I went for Other adding further information as Encyclopedic and/or Bibliographic Citation in the interest of avoiding any mess with https://en.wikipedia.org/wiki/Wikipedia:No_original_research.

I got back a We have successfully received the Bot information you submitted. Thank you. and that's about it.

I should note I have 0 ways of knowing how the submission fares, when it will be reviewed, what the result will be etc. The form allows me just to submit it without any visible way to view, edit or know the status. Hopefully we should it under https://radar.cloudflare.com/traffic/verified-bots at some point.

Switching to low as all we can do now is wait.

There's only 18 bots on the list at https://radar.cloudflare.com/traffic/verified-bots. Hopefully that isn't a sign of a slow or difficult application process.

There's only 18 bots on the list at https://radar.cloudflare.com/traffic/verified-bots. Hopefully that isn't a sign of a slow or difficult application process.

18 10-item pages of bots. 178 in total. I fell in same trap initially ;-). I am hoping the same, but I should note that it is evidently possible to submit that form using even a free cloudflare account.

Update: I've resubmitted the bot and written to Cloudflare, asking them to review our submission. I'll update the task as soon as I hear back from them.

Update: I've resubmitted the bot and written to Cloudflare, asking them to review our submission. I'll update the task as soon as I hear back from them.

Wonderful – thank you, @joanna_borun.

I verified and citoid is now a verified bot.
https://radar.cloudflare.com/traffic/verified-bots - search for citoid.

Wooohooo. Thanks for taking over and handling it @joanna_borun

@ppelberg I think this is done, I 'll resolve, but feel free to reopen.

Wooohooo. Thanks for taking over and handling it @joanna_borun

+1!

Wonderful work, @joanna_borun 👏🏼

@ppelberg I think this is done, I 'll resolve, but feel free to reopen.

Agreed and thank you for all of the thought and attention you allocated to this, @akosiaris.

A screenshot for good measure...

image.png (902×1 px, 126 KB)