Page MenuHomePhabricator

NEW/CHANGE FEATURE REQUEST: Documentation for v1 Enterprise endpoint deprecation
Closed, ResolvedPublicFeature

Description

Data Platform Request Form

Is this a request for a:

  • Dataset
  • Data Pipeline
  • Data Feature

Is this a change to something existing:

  • Yes - please provide details of existing datasets/data pipelines (wiki links, Git URL, names of jobs, etc)
  • No

Please provide the description of your request:

Our team plans to deprecate the v1 Enterprise API endpoints this Monday (3/24/25) to fully support the new v2 endpoints and remove the v1 overhead.

This will be a breaking change for the HTML dumps that generate on the 2nd and 21st of every month currently. Users can still access this content via the free version of the Enterprise APIs with an account, or via Wikimedia Cloud Services. Historical HTML dumps will continue to be available.

We need written updates on the following pages/subsections to ensure folks still have access to new content:

(1) https://dumps.wikimedia.org/ subsection: Static HTML dumps
Text (new):
A copy of all pages from all Wikipedia wikis, in HTML form.
These are currently not running, but Wikimedia Enterprise HTML dumps are provided for some wikis and the historical versions are available here. New versions can still be downloaded via the Enterprise APIs.

(2) https://dumps.wikimedia.org/other/enterprise_html/ (full page changes)
Text (new):

The partial mirrors of Wikimedia Enterprise HTML dumps were an experimental service that, as of 03/24/2025, are no longer replicated here.
If you are in need of recent runs, dumps of article change updates, or the ability to query individual articles from the dumps, visit Wikimedia Enterprise to sign up for a free account. Alternatively use your developer account to access APIs within Wikimedia Cloud Services.

Historical Archive
The historical dumps will remain available here for a specific set of namespaces and wikis for public download. Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file, with a single line per article, in json format. To view the payload schema visit Wikimedia Enterprise API Data Dictionary.
Accompanying the tar.gz file is a file containing the md5sum and the date the dump was produced, also in JSON format. All files for a dump run are included in a single directory of the format YYYYMMDD.

View those directories here: other/enterprise_html/runs

Use Case: (Please briefly explain what this feature will be used for):

We need written updates that reflect the current data environment to ensure folks still have access to the content they need.

Ideal Delivery Date:

By 3/28, but the first missed update will happen on 4/2 so that is the absolute cut-off before the current language becomes obsolete.

Data Feature Checklist

Please link to the following if applicable.

Document TypeRequired?Document/Link
Related PHAB TicketsN/A<add link here>
Product One PagerN/A<add link here>
Product Requirements Document (PRD)N/A<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

Event Timeline

@Ahoelzl changes to the service or the site?

xcollazo subscribed.

Closing, as I can see the changes reflected at https://dumps.wikimedia.org/other/enterprise_html/.

I was pointed to this ticket on T390839. I'd like to raise some concerns regarding the reduced accessibility and visibility of HTML dumps following the removal of the mirroring.

Getting the HTML dump is now a lot more involved. You need to create a separate account on the enterprise API site (no unified login?), then get a token, then make a request to get the list of snapshots, then make another request to download the dump.

Compared to just downloading a single file from a well-known location.

It is also not clear to me how the API limits work (the UI shows 15 free snapshot requests – meaning I can download 15 snapshots per month?). It was also mentioned that requests issued from WMCS might not have this limit. Is this correct?

Hey @jberkel

yes accessing these through the Enterprise APIs does require a separate login for now and is a little more involved than how it was on the WMF dumps site. Hopefully the SDKs available on github make it easier to get started.

As for the understanding on the limits and usage your explanation is correct. You get 15 api requests and that means you can download 15 snapshots over the course of the month. And yes the limits do not apply if you use WMCS.

Hey @jberkel

yes accessing these through the Enterprise APIs does require a separate login for now and is a little more involved than how it was on the WMF dumps site. Hopefully the SDKs available on github make it easier to get started.

Thanks, I've tried using the API/SDK, but it seems quite buggy (on the API side), see T393198 and T393203. In the end I was able to download the snapshots "unchunked", but that's not ideal for parallel processing.