Page MenuHomePhabricator

Provide auth-less access to Enterprise APIs from WMF Analytics cluster
Open, Needs TriagePublic

Description

Wikimedia Enterprise APIs provide various resources, for example, HTML dumps of Wikimedia sites. This resource used to be available via Dumps-Generation, but not anymore. The only way to access those data is via the API (docs).

The API provides direct access to requests from certain IPs (list of IPs is at P82115), allowing access to Toolforge and Wikimedia Cloud in general. While this exemption makes it easier to use Enterprise data for the community, WMF work often happens on the internal Analytics cluster (stat* hosts), which is currently not included:

[urbanecm@stat1008 ~]$ curl https://api.enterprise.wikimedia.com/v2/snapshots
{"status":401,"message":"Unauthorized"}
[urbanecm@stat1008 ~]$

vs

urbanecm@bastion-eqiad1-03:~$ curl -s https://api.enterprise.wikimedia.com/v2/snapshots | jq . | head
[
  {
    "identifier": "aawiki_namespace_0",
    "version": "8a3f250fd3c79160e464212519e0b649",
    "date_modified": "2025-08-29T01:19:35.864325418Z",
    "is_part_of": {
      "identifier": "aawiki"
    },
    "in_language": {
      "identifier": "aa"
urbanecm@bastion-eqiad1-03:~$

To allow access from analytics hosts, we would likely need to allow access for install* hosts (or at the very least, install1004 and install1005). As far as I know, this means allowing access from the following IPs:

  • 208.80.154.74 (install1004)
  • 208.80.154.134 (install1005)
  • 208.80.153.105 (install2004)
  • 208.80.153.70 (install2005)
  • 185.15.59.3 (install3003)
  • 198.35.26.11 (install4003)
  • 103.102.166.11 (install5003)
  • 185.15.58.7 (install6003)
  • 195.200.68.100 (install7002)

I'm not sure if there is a way to put this in a single range. We specifically probably shouldn't use all of our production-assigned ranges, as eg. urldownloader makes requests on behalf of MediaWIki, which might result in unauthorised external requests (OTOH, MediaWIki allowing arbitrary HTTP calls for an unprivileged user is bad enough as it is, so maybe it doesn't matter?)

Event Timeline

SRE: Would you mind confirming the appropriate IPs or ranges that would need to be allowlisted to include analytics cluster?
Wikimedia Enterprise: Would you mind doing the allowlist once the IPs are confirmed?

This task was created based on a discussion with @prabhat, see WMF Slack.

Please keep in mind that allowing the HTTP proxy IPs will ultimately allow Enterprise API access from all systems allowed to use the HTTP proxies, not only analytics/stats hosts. Also these IPs/hosts might and will change in the future so they would have to be updates regularly. Because of that I would suggest to check if there is a different way to authenticate the stats hosts.

Given that there are no stat hosts in other datacenters then eqiad there is no need to allow other IPs then 208.80.154.74 / 2620:0:861:3:208:80:154:74 (for install1004.wikimedia.org).

Please keep in mind that allowing the HTTP proxy IPs will ultimately allow Enterprise API access from all systems allowed to use the HTTP proxies, not only analytics/stats hosts.

From my perspective, that is reasonable. We allow access to Wikimedia Cloud, and any production host should be in principle more trustworthy than anything in Wikimedia Cloud.

Also these IPs/hosts might and will change in the future so they would have to be updates regularly.

How often might that happen?

Because of that I would suggest to check if there is a different way to authenticate the stats hosts.

Do you have any thoughts on what that might be? Would puppetizing some kind of shared credentials (and possibly building a wrapper like analytics-mysql, so people do not need to remember where the credentials are and how to use them) be a better approach overall?

Given that there are no stat hosts in other datacenters then eqiad there is no need to allow other IPs then 208.80.154.74 / 2620:0:861:3:208:80:154:74 (for install1004.wikimedia.org).

Ack, that makes sense. I also see install1005, is that not something we'd need to account for?

Also these IPs/hosts might and will change in the future so they would have to be updates regularly.

How often might that happen?

I can't say for sure. Definitely for every Debian OS version upgrade plus hardware refreshes.

Because of that I would suggest to check if there is a different way to authenticate the stats hosts.

Do you have any thoughts on what that might be? Would puppetizing some kind of shared credentials (and possibly building a wrapper like analytics-mysql, so people do not need to remember where the credentials are and how to use them) be a better approach overall?

I personally would prefer that over hardcoding some IPs.

Given that there are no stat hosts in other datacenters then eqiad there is no need to allow other IPs then 208.80.154.74 / 2620:0:861:3:208:80:154:74 (for install1004.wikimedia.org).

Ack, that makes sense. I also see install1005, is that not something we'd need to account for?

Currently only install1004 has the HTTP proxy installed.

How often might that happen?

I can't say for sure. Definitely for every Debian OS version upgrade plus hardware refreshes.

These are VMs, so roughly every two years for Debian OS updates.

Ack, that makes sense. I also see install1005, is that not something we'd need to account for?

Currently only install1004 has the HTTP proxy installed.

install1004 is the current active proxy and install1005 will assume it's role over the next few weeks.

reading the exchange above. I feel like there is a need to get the IPs whitelisted and i can forward that request but I also see that there is an exploration for other mechanisms being considered as well. Are we ready to ask for IP whitelisting or do we want to go another route to avoid requesting IP list changes?

reading the exchange above. I feel like there is a need to get the IPs whitelisted and i can forward that request but I also see that there is an exploration for other mechanisms being considered as well. Are we ready to ask for IP whitelisting or do we want to go another route to avoid requesting IP list changes?

Is this already getting closer to a decision? Adding here that Wikimedia Deutschland's TechWish team would also benefit from being able to access the API from the analytics cluster without the need for extra authentication. Thanks for looking into this!

Hi @Urbanecm_WMF, @Urbanecm and @JMeybohm,

Could you let us know what additional information you need to make a decision on this issue? We’d like to start our work in our upcoming sprint next Tuesday and need to know whether we should look for an alternative solution or if you’ll be able to add the IPs to the allowlist.

We’re dependent on this decision because our previous method of accessing the data dumps no longer works and we need to update it. Allowlisting the IPs would definitely be the easier way for us.
The reason for accessing the data is to build and analyse our success metrics for Sub-referencing.

Hope to hear from you soon. Thanks a ton and let me know if you need any further information!

If we're talking about access from the analytics network, doesn't it make more sense to regularly import the datasets into HDFS? And deal with authentication in an import job.

If we're talking about access from the analytics network, doesn't it make more sense to regularly import the datasets into HDFS? And deal with authentication in an import job.

WME snapshots are coming from AWS, so I'm glad you ask this question.

We are about to start implementing pipelines that ingest Structured Content Dumps. As of now, they will run in AirFlow, so accessing the data from HDFS would work for us, too.

Related: slack thread.

Reviewing this discussions shows there are three main options to consider. To facilitate decision-making, I'm going to summarize all of them below.


(A) Allowlist the install host IPs

Summary: Allow auth-less access to Enterprise APIs when proxying through the install hosts

Pros:

  • Trivial to implement
  • Already in place for community access (through Wikimedia Cloud / Toolforge) => IP changes need to be managed regardless & easy to discover if familiar with the pattern from elsewhere
  • Supports on-demand and realtime API, as well as snapshots

Cons:

  • The allowlist needs to be updated every time the installhosts get reimaged
  • Might result in storing duplicate data (=multiple users downloading the same snapshot)
  • Inconsistent with how other data are accessed within the analytics cluster
(B) Periodically import the dumps into HDFS

Summary: Create a job that downloads all snapshots from Wikimedia Enterprise and loads them into HDFS; the job still needs to auth to enterprise, which can happen through credentials (option C) or IP (option A).

Pros:

  • Stable usage of storage and transport capacities
  • Consistent access pattern with other data (XML dumps or events)

Cons:

  • Requires the job to be implemented
  • Requires new credentials to be maintained (or (A) to be implemented in parallel)
  • Does not allow usage of anything else besides snapshots
(C) Expose Wikimedia Enterprise credentials within the analytics cluster

Summary: Put valid Wikimedia Enterprise credentials within the cluster, allowing everyone to auth request using them

Pros:

  • Similar to accessing MediaWiki replicas
  • Supports on-demand and realtime API, as well as snapshots
  • Survives server reimages

Cons:

  • Relies on documentation for users to find tokens
  • Might result in storing duplicate data (=multiple users downloading the same snapshot)

Based on the summary, it seems that either (A) or (C) are the right next step. Considering the demand (evidenced on this ticket), it might make sense to go with implementation simplicity, and if duplicate data/reimages prove to be a problem in the future, we might revisit this decision. As far as I can see, this would have to be a Data-Engineering call, and I'm curious for their review of my summary above.

I'm expecting the Data-Engineering team to drive this discussion. Data-Platform-SRE can help with implementation once we know what we want to do.

I would be interested to see the code that creates the structured representation from the html of a page, can someone point me to the repository?

Based on the summary, it seems that either (A) or (C) are the right next step. Considering the demand (evidenced on this ticket), it might make sense to go with implementation simplicity, and if duplicate data/reimages prove to be a problem in the future, we might revisit this decision. As far as I can see, this would have to be a Data-Engineering call, and I'm curious for their review of my summary above.

Just to add my two cents: from my limited knowledge (so please correct me if I missed something), it seems that most folks are asking about access to the snapshots as opposed to on-demand or real-time APIs. That would mean that everyone's needs would be fully supported by downloading each snapshot once and making general available (option b). Limiting to snapshot API for analyses run from the stat machines makes sense in general too as stat hosts generally aren't being used for analysis that requires the most current of data and, if we do, we generally work to build an official pipeline for that that does the API querying for us. I've downloaded snapshots in the past to the stat machines and these are huge files too so I'm always hesitant to delete them after download because of the cost of refetching them if I realize I made a mistake. So it's happened before that I've had to sheepishly go and delete them after getting a nudge about stat machine hard drives filling up and me being a prime suspect. Having them generally available via HDFS would also be helpful for lots of ad-hoc research/analyses. For instance, I queried the Parsoid APIs this week from the stat machines for some work that required HTML for several thousand articles but would have happily just used the most recent snapshot. Almost any content analysis is going to be easier and more accurate when using HTML instead of wikitext. I recognize that making this a formal data product incurs some additional overhead but please let me know if we can help with making the case as I suspect the benefits are worth it.