Page MenuHomePhabricator

[SPIKE] Determine the best approach for surfacing 'Transform' endpoints in the REST Sandbox
Closed, ResolvedPublic8 Estimated Story Points

Description

Conditions of acceptance

  • Explore different options for how we could better handle 'transform' endpoints within the MW REST API. A few options are below:
    • Create a handler wrapper that flattens the routes to be declared definitively.
    • Change the underlying behavior of the endpoints so that from and to are not dynamic route elements.
    • OpenAPI generation override for these endpoints specifically.
  • Make a recommendation, including trade offs of each decision.
  • Create a PoC patch that demonstrates at least one of the different approaches.
  • Review the proposal with MWI and CTT teams for input prior to full implementation.
  • If additional changes are required, create follow up tickets.

Implementation details

NOTE: The Parsoid extension uses the same handler base classes.

Event Timeline

HCoplin-WMF updated the task description. (Show Details)
HCoplin-WMF set the point value for this task to 8.
HCoplin-WMF raised the priority of this task from Medium to High.Jun 17 2025, 3:22 PM

See the parent task, T391712: Transform "Try it out" doesn't work in REST Sandbox for a description of the issue we're trying to solve (as well as making this all more maintainable in the future).

OpenAPI spec generation in our system works on a a per-module basis. That is, each module has its own generated OpenAPI spec. The legacy routes defined in coreRoutes.json receiving special handling via ExtraRoutesModule. So even though routes in that file aren't in a proper module, they're treated somewhat as one for spec generation purposes.

Omitting a lot of details that are irrelevant for this task, the actual per-endpoint spec generation instantiates and initializes a Hander-derived class, then calls its getOpenApiSpec function.

In many cases, each endpoint/path corresponds to its own class derived from Handler. That class can therefore override the various functions in Handler related to OpenAPI specs in whatever way it needs to, in order to return correct information for the path it handles. In several cases (RevisionHTMLHandler, RevisionSourceHandler, SearchHandler, PageSourceHandler, PageHTMLHandler), one handler handles multiple paths. In all of these cases, the behavior across the paths is very similar or identical, making it reasonable to add at most a small amount of logic to the OpenAPI spec generation functions.

TransformHandler is more complicated. It handles nine different paths, and extends ParsoidHandler. ParsoidHandler has some opinions about path parameters and is extended by Parsoid. Also, Parsoid has its own TransformHandler that derives from the core TransformHandler. Much of this code predates the current spec generation system, including the changes to parameter validation we made as part of that. So there's a lot going on, and we have to be careful about breaking things.

A more sweeping change would be to introduce something like an Endpoint class and move at least some of the OpenAPI spec generation code there. Right now, our class design doesn't really match what we're trying to do. Handlers, by design, can service multiple paths. But OpenAPI specs are on a per-path basis. There isn't a PHP class corresponding to a path on a one-to-one basis, leaving us dealing with situations like this task on a case-by-case basis. A dedicated "Endpoint" class (or whatever name we pick) could make that more explicit in the code. But that's a bigger change, probably too big for this task.

Change #1174773 had a related patch set uploaded (by BPirkle; author: BPirkle):

[mediawiki/core@master] REST: produce proper OpenAPI specs for transform endpoints

https://gerrit.wikimedia.org/r/1174773

Observation: the Transform handler accepts a required "html" request body parameter, but allows it to be of multiple types. This is a complication we can work around, but will it is an uncommon situation and will therefore require an uncommon solution.

For example, these both return a 200 response with the response body "== Foo ==":

curl -X 'POST' \
  'https://test.wikipedia.org/w/rest.php/v1/transform/html/to/wikitext/' \
  -H 'accept: text/plain' \
  -H 'Content-Type: application/json' \
  -d '{
  "html": "<h2>Foo</h2>"
}'
curl -X 'POST' \
  'https://test.wikipedia.org/w/rest.php/v1/transform/html/to/wikitext/' \
  -H 'accept: text/plain' \
  -H 'Content-Type: application/json' \
  -d '{
  "html": {
          "headers": {
                  "content-type": "text/html;profile='https://www.mediawiki.org/wiki/Specs/HTML/2.4.0'"
          },
          "body": "<h2>Foo</h2>"
  }
}'

In the first command, the "html" request body parameter is a string, the second command it is a JSON object. The related code in TransformHandler includes the comment "Accept html as a string or object{body,headers}".

This doesn't play nicely with our normal getBodyParamSettings() function, which expects us to be able to assign a single PARAM_TYPE value to each body parameter. TransformHandler already recognizes if the "html" request body parameter is missing or malformed and returns an appropriate message, so no additional validation is necessary. We just have to make the generated OpenAPI spec include something acceptable, so that the REST Sandbox works.

There are a few ways we may be able to work around this without changes to TransformHandler. I'm trying out some possibilities.

Also worth noting that we were effectively faking optional path parameters until now, which allowed endpoints both with and without trailing slashes to work for public callers, for urls like:

/transform/html/to/wikitext
/transform/html/to/wikitext/

This affects six of the nine endpoints that will be dealt with in this change. Turnillo shows that the majority of callers are using the "without trailing slash" flavor, but there are still thousands of calls per month to the "with trailing slash" flavor. I will therefore explicitly add route definitions for both flavors, allowing us to do proper path parameter validation, and properly expose path parameters in the REST Sandbox, without breaking any callers.

This means that both flavors will show up in the REST Sandbox. If we find this undesirable, we could probably hack around it somehow. But I'd rather the REST Sandbox reflect reality, even if it is a little ugly.

Change #1174773 merged by jenkins-bot:

[mediawiki/core@master] REST: produce proper OpenAPI specs for transform endpoints

https://gerrit.wikimedia.org/r/1174773

The patch that was merged adds basic functional /transform endpoint behavior to the REST Sandbox. It does not make every possible optional parameter available (notably query parameters, but there may be others).

I suggest we consider the merged code as the PoC mentioned in the task description, and address the other parameters under parent task T391712: Transform "Try it out" doesn't work in REST Sandbox.

Also worth noting that we were effectively faking optional path parameters until now, which allowed endpoints both with and without trailing slashes to work for public callers, for urls like:

/transform/html/to/wikitext
/transform/html/to/wikitext/

This affects six of the nine endpoints that will be dealt with in this change. Turnillo shows that the majority of callers are using the "without trailing slash" flavor, but there are still thousands of calls per month to the "with trailing slash" flavor. I will therefore explicitly add route definitions for both flavors, allowing us to do proper path parameter validation, and properly expose path parameters in the REST Sandbox, without breaking any callers.

How did the requests with trailing slashes work before your patch? They were already routed to the gateway and to rest.php (e.g. not via restbase). Did they just get "did not match any known handler" errors?

Just chiming in that my recommendation would be that we only show the 'preferred' method in the Sandbox. It is confusing for end users to see duplicate endpoints without a clear reason for it. If we document both, it also increases the odds that folks will use both. I would suggest that take a more opinionated stance, and that we officially get the lesser used and less desired version on the path to deprecation.

Looking at the traffic in Turnilo, it's really not much. It also looks like effectively all of the requests coming in with a trailing slash but without an article title/revision following are resulting in errors: https://w.wiki/F7Mq
It's worth digging in a bit more, but on first glance I'm slightly dubious of how much of the trailing slash traffic is impactful, or even real.

How did the requests with trailing slashes work before your patch? They were already routed to the gateway and to rest.php (e.g. not via restbase). Did they just get "did not match any known handler" errors?

Refreshing myself on this after vacation. Hopefully the following analysis is correct.

The trailing slash routes did actually work prior to the change attached to this task.

The reason that six of the endpoints affected and not nine is that is is the affected routes also have a variation where there is a subsequent path parameter. In the trailing slash case, we invoked the route with the subsequent path parameter, detected in code that there wasn't an actual parameter there, and behaved as if the other version had been called. For example, we previously defined (only) the following routes for html to wikitext:

/v1/transform/html/to/wikitext
/v1/transform/html/to/wikitext/{title}
/v1/transform/html/to/wikitext/{title}/{revision}

If a caller requested /v1/transform/html/to/wikitext/, it actually invoked /v1/transform/html/to/wikitext/{title}. TransformHandler defines these path parameters as optional. This is possible due to a compromise I'm not completely comfortable with and which I think we should revisit someday. But not today.

Anyway, the presence of an optional path parameter in the handler, and the presence of a route with a path parameter after the trailing slash, combined to make the trailing slash routes work. That's what I meant when I said we were faking optional path parameters.

Just chiming in that my recommendation would be that we only show the 'preferred' method in the Sandbox. It is confusing for end users to see duplicate endpoints without a clear reason for it. If we document both, it also increases the odds that folks will use both. I would suggest that take a more opinionated stance, and that we officially get the lesser used and less desired version on the path to deprecation.

Looking at the traffic in Turnilo, it's really not much. It also looks like effectively all of the requests coming in with a trailing slash but without an article title/revision following are resulting in errors: https://w.wiki/F7Mq
It's worth digging in a bit more, but on first glance I'm slightly dubious of how much of the trailing slash traffic is impactful, or even real.

If we can just kill them, that'd be awesome. I was nervous about doing that in the initial patch, and concerned it'd be a barrier to getting it merged. Doing that in a separate patch would be cleaner, because that patch could be rolled back if it turns out we're wrong, without affecting any larger change.

If we're not confident we can immediately just remove them, then maybe this is an opportunity to work out a deprecation procedure and go through a redirection-with-deprecation period on them?

As for the option of keeping them but not displaying them in the REST Sandbox - there is probably a use case for hidden endpoints (Daniel's audience designation proposal anticipates this in some ways). I wouldn't object to introducing a mechanism to hide routes from the REST Sandbox. I would not be fond of doing it in a hacky way - I'd rather it be an actual mechanism that expresses intent in the code, in an easily discoverable way. And of course, there's no such thing as an actual "hidden" endpoint in open source code like this, nor should there be. The idea of an explicit sandbox suppression mechanism is precisely for the purpose of "hiding" something from general discoverability, because that's of benefit to both callers and us, while still being "open" in the code.

Task description says create a PoC, but we did a bit more and merged an actual patch to production.

That seems to complete the spike criteria (plus a little more). I'm going to close this task as Resolved. Any remaining sandbox issues related to these endpoints can be dealt with under the parent task and/or additional subtasks.