Page MenuHomePhabricator

'media' endpoint decommission postmortem and service lifecycle process enhancement
Closed, ResolvedPublic

Description

As discussed in T248553: `media` endpoint stopped working as before it appears the 'media' endpoint decommission resulted in breakage.

This is a task to represent work to do a simple postmortem and then review and enhance the service lifecycle process.

BACKGROUND

The original design of the Page Content Service ca. 2017 called for several API endpoints for retrieving page-related data. The goal of the project was to replace the legacy /page/mobile-sections endpoints with a more robust and fully-featured API suitable for adoption by a wide variety of clients. The planned PCS endpoints included /page/summary, which provided a summary of the most important page metadata and was used for supporting Page Previews; /page/mobile-html for providing pre-formatted page HTML content suitable for native app clients; endpoints for page-specific scripts and styles; and several supporting endpoints for retriving JSON-structured page metadata: /page/media for providing data about the media items used on a page, /page/references for providing structured reference content, and /page/metadata for providing miscellaneous metadata not already included elsewhere. In late 2017 and early 2018 most of the PCS endpoints were publicly launched, with their stability listed as experimental in the Wikimedia REST API documentation. The mobile-html endpoint proved to be a more complex undertaking than the others, and was not launched until 2019.

In mid- to late 2019, as the app teams began the work of adopting Page Content Service endpoints, the service endpoints began to evolve for various reasons from the original 2017 plan. It came to light that /page/media was impossible to invalidate correctly from cache without infrastructure support beyond the scope of the PCS project, so it was superseded by a more narrowly scoped /page/media-list endpoint. The page references were incorporated into the mobile-html content, rendering the /page/references endpoint unnecessary, and the various items included in /page/metadata were either incorporated into other endpoints or declared unnecessary. Hence, PI and CPT planned to remove the experimental /page/media, /page/references, and /page/metadata endpoints from the public REST API at a convenient time.

Meanwhile, unbeknownst to PI, the Inuka team began consuming the /page/media endpoint in the new KaiOS app.

In March 2020, PI and CPT removed the /page/media, /page/references, and /page/metadata endpoints from the REST API, declining to announce the change since the endpoints were experimental. Unfortunately, removing /page/media broke the prototype KaiOS app. This created additional work for the Inuka team but did not affect end users since the KaiOS app has not yet launched.

ISSUES PRESENTED

  • Visibility of endpoint stability: The stability of an API endpoint should be readily apparent and easy to find. Developers consuming an experimental, unstable, or beta endpoint should be aware that they are doing so.
  • Communication: More communication between the developer teams would likely have prevented any mistaken assumptions about the stability level or future of the various PCS endpoints. Of course, developers are not required to ask permission before consuming a public API, but discussing plans with the API's maintainers is advisable. Likewise, it is advisable for API maintainers to publicly advertise changes like removing a long-standing public API endpoint, even if that endpoint is advertised as experimental.

RECOMMENDATIONS
Product Infrastructure should take the following actions to prevent future incidents:

  • Improve API stability documentation:
    • The Page Content Service wiki page should advertise the stability of the public API endpoints. Where the stability is anything less than stable, the stability level should be featured prominently.
    • Link to the MediaWiki.org endpoint documentation from each OpenAPI / Swagger spec
    • Specify recommended email list for being notified of breaking changes from each OpenAPI / Swagger spec
    • Specify recommended email list for being notified of breaking changes from the MediaWiki.org pages
  • Announce all breaking API changes: As a courtesy, all breaking API changes should be announced on the recommended mailing list specified above at least two weeks in advance, even if the affected endpoints are marked experimental.

Event Timeline

Added a postmortem to the task description.

Thanks @Mholloway. Those are good ideas. Would any of the following work as well?

  • Link to the MediaWiki.org endpoint documentation from each OpenAPI / Swagger spec
  • Specify recommended email list for being notified of breaking changes from each OpenAPI / Swagger spec
  • Specify recommended email list for being notified of breaking changes from the MediaWiki.org pages

I've been wondering, is there a streamlined way to have coarse-grained observability of callers of the endpoints and a process to reach them more proactively in addition to wikitech-l?

Also, is there anything APIs and suggested client behavior could do in a consistent way to provide some sort of warning of impending decom or breaking changes so developers start seeing warnings in their console or their own observability systems?

Thanks @Mholloway. Those are good ideas. Would any of the following work as well?

  • Link to the MediaWiki.org endpoint documentation from each OpenAPI / Swagger spec
  • Specify recommended email list for being notified of breaking changes from each OpenAPI / Swagger spec
  • Specify recommended email list for being notified of breaking changes from the MediaWiki.org pages

Yep, those sound good. Added.

I've been wondering, is there a streamlined way to have coarse-grained observability of callers of the endpoints and a process to reach them more proactively in addition to wikitech-l?

I think we can get that info from Logstash. However, it's not currently part of the mobileapps Kibana dashboard.

Also, is there anything APIs and suggested client behavior could do in a consistent way to provide some sort of warning of impending decom or breaking changes so developers start seeing warnings in their console or their own observability systems?

I did some quick searches ytsterday to see if there was some standard or convention we could leverage for this, but wasn't able to find much beyond versioning. There is a pretty recent IETF draft specification for a Deprecation header which would be useful in situations like this. There's also an existing Warning header, but that's apparently soon-to-be-deprecated.

One other idea I want to put into the mix is that there should probably be a presumption that service endpoints "graduate" in stability to the next level after a certain period of time, if that stability hasn't otherwise been changed and the endpoint is still in service. Or some expectation that service maintainers review endpoint stability designations regularly. It's not really fair to consumers to leave endpoints with an "experimental" designation for years on end, even if they're supporting production apps.

The proposals sound good to me. @Pchelolo, do you have more ideas or comments? Is there already a mailing list for RESTBase services announcements? (I assume we don't use mediawiki-api-announce@lists.wikimedia.org for this.)

The proposals sound good to me. @Pchelolo, do you have more ideas or comments? Is there already a mailing list for RESTBase services announcements? (I assume we don't use mediawiki-api-announce@lists.wikimedia.org for this.)

Not really. I guess we can start, why not. It's an API, just like any other one.

The proposal is good for me as well.

TODO: Memorialize this on-wiki and ping Adam to send an email announcement

I want to clarify a couple of points:

  • Is everyone OK with wikitech-l and mediawiki-api-announce for announcements?
  • We probably don't want to announce all breaking changes to an experimental endpoint with two weeks' notice — only severe actions like removal. OK to update this policy to limit its scope to announcing the removal of an experimental endpoint?

And how about this:

  • API endpoints should be marked as deprecated as soon as the team agrees that they should be removed at some point in the future, even if there is no specific plan or deadline for their removal.

This would apply currently to the mobile-sections endpoints, for example.

OK to update this policy to limit its scope to announcing the removal of an experimental endpoint?

What about removal of properties in the response of a stable endpoint? Should that be included in announcements as well?

API endpoints should be marked as deprecated as soon as the team agrees that they should be removed at some point in the future, even if there is no specific plan or deadline for their removal.

Agreed. We should also mark deprecated field as such in the OpenApiSpec (deprecated: true).

OK to update this policy to limit its scope to announcing the removal of an experimental endpoint?

What about removal of properties in the response of a stable endpoint? Should that be included in announcements as well?

Ah, yes — I was assuming that to be the case, but it should be made perfectly clear. So I guess the policy would be something like:

Changes to all API endpoints will follow the endpoint stability guidelines for all Wikimedia APIs at a minimum. Further, the removal of public endpoints will be announced on wikitech-l and mediawiki-api-announce at least two weeks in advance of the removal, even if the affected endpoints are marked experimental.

Changes to all API endpoints will follow the endpoint stability guidelines for all Wikimedia APIs at a minimum. Further, the removal of public endpoints will be announced on wikitech-l and mediawiki-api-announce at least two weeks in advance of the removal, even if the affected endpoints are marked experimental.

This still has no direct statement about removal of fields. There's a hint of that in the linked document about stable endpoints but it's not spelled out what this actually means.

For the api_urls in summary case, does this mean that we can never remove this field? Or we can remove it when incrementing the major version number? Do we need to provide a way for clients to get the older version with the field added?

I guess I'm struggling then with how much detail we should put into our team-specific policy. Do we need a fully specified contract for each stability level? If so, maybe we should spend our time adding detail to the main endpoint stability policy, and just use it.

Also, after thinking about it some more, I'm somewhat concerned we're overreacting by promising advance warnings about breaking changes to experimental endpoints. If that's the case, there really is no such thing as an "experimental" stability level anymore, which I think we'd come to regret in time. The important thing is to make it hard for clients to miss when they're working with experimental endpoints, which is mostly a documentation problem.

  • API endpoints should be marked as deprecated as soon as the team agrees that they should be removed at some point in the future, even if there is no specific plan or deadline for their removal.

Agreed.

I guess I'm struggling then with how much detail we should put into our team-specific policy. Do we need a fully specified contract for each stability level?

A contract for every stability level is a good guideline to have, but not as part of the deprecation policy IMO this is out of scope for this ticket.

Also, after thinking about it some more, I'm somewhat concerned we're overreacting by promising advance warnings about breaking changes to experimental endpoints.

Also agree, it should be enough to announce just the deprecation of experimental endpoints.

This still has no direct statement about removal of fields. There's a hint of that in the linked document about stable endpoints but it's not spelled out what this actually means.

I guess this is more of a question about what is a breaking change than to have specificities to the policy, I think the two things are complementary but separate. Do we consider removal of fields a breaking change? What can we list as a breaking change for the APIs?

For the api_urls in summary case, does this mean that we can never remove this field? Or we can remove it when incrementing the major version number?

If we consider the removal of fields a breaking change, my understanding is that we can remove it, with 2 weeks of warning in advance since the summary endpoint is marked as stable and then increment the major version as is expected of semver.

Do we need to provide a way for clients to get the older version with the field added?

I guess this is technically not in our hands while restbase is proxying requests and defining the API versions.

Thanks @MSantos for the comments.

Do we consider removal of fields a breaking change?

Yes, I think any change that's not strictly additive would be considered a breaking change.

A contract for every stability level is a good guideline to have, but not as part of the deprecation policy IMO this is out of scope for this ticket.

One thought that occurred to me earlier (but I forgot to mention yesterday) is we could define a fuller policy of our own for each deprecation level, with the goal of "upstreaming" it to apply to the REST API in general. I don't like the idea of maintaining a separate, PI-specific set of stability guarantees over the long term.

@bearND, in T248570#6110205, are you asking for something like a language tweak (something like replacing "the removal of public endpoints" with "all breaking changes"), or for a fuller, more detailed statement of policy?

@Mholloway I think at that level changing it to "all breaking changes" is fine. The linked document should be the latter. It should be on a more granular level and should elaborate what kinds of breaking changes are handled in what way.

If we consider the removal of fields a breaking change, my understanding is that we can remove it, with 2 weeks of warning in advance since the summary endpoint is marked as stable and then increment the major version as is expected of semver.

Where is this 2 weeks taken from? 2 weeks of warning for a stable endpoint breakage seems very short to me.

Do we need to provide a way for clients to get the older version with the field added?

I guess this is technically not in our hands while restbase is proxying requests and defining the API versions.

We've made changes to RB in the past. An example of dealing with the change of the summary endpoint is actually still in the RB master branch, see summary.yaml vs summary_new.yaml. (That was a bit more difficult back then when it switched from a complete RB-doesn't use MCS/PCS implementation to one that uses MCS/PCS.)
We don't do this lightly, of course. It definitely reduces cache hits. Also I suspect some work to change some Varnish VCL might be required as well.

Where is this 2 weeks taken from? 2 weeks of warning for a stable endpoint breakage seems very short to me.

I was just conjecturing from Michael's example (see above). I agree it was confusing though, what I meant is that for stable endpoints we should commit to a specific announcement timeline before "removing the field from the response", a.k.a. breaking change, and just do it if there is no reasonable push-back from clients.

Changes to all API endpoints will follow the endpoint stability guidelines for all Wikimedia APIs at a minimum. Further, the removal of public endpoints will be announced on wikitech-l and mediawiki-api-announce at least two weeks in advance of the removal, even if the affected endpoints are marked experimental.

We've made changes to RB in the past. An example of dealing with the change of the summary endpoint is actually still in the RB master branch, see summary.yaml vs summary_new.yaml. (That was a bit more difficult back then when it switched from a complete RB-doesn't use MCS/PCS implementation to one that uses MCS/PCS.)
We don't do this lightly, of course. It definitely reduces cache hits. Also I suspect some work to change some Varnish VCL might be required as well.

That's nice to know, thanks for the info! I would be interested in discussing more on endpoint versioning and the availability of those versions. I just think that this is an external work to the product, even if we can work on it we are still depending on one team to facilitate the production environment for us, thus I think it's out of the scope of our team's policy.

I've updated the draft at https://www.mediawiki.org/wiki/Wikimedia_Product/Wikimedia_Product_Infrastructure_team/API_endpoint_deprecation_policy again. Please have a look and see if it's acceptable. Of course, if you have specific concerns or changes in mind, you can always edit it directly.

LGTM, but interested to hear thoughts of others.

I think we should also add a procedure for deprecating fields and endpoints. How about adding something like this?

  1. Team decides a field is deprecated.
  2. Team marks a field or endpoint as deprecated in the OpenAPI spec.
  3. Team member announces the breaking change. ...

My point here is mainly the addition for the 2nd step: mark fields (or endpoints) as deprecated in the OpenAPI spec. Fields can be marked as such using deprecated: true.

I think we should also add a procedure for deprecating fields and endpoints. How about adding something like this?

  1. Team decides a field is deprecated.
  2. Team marks a field or endpoint as deprecated in the OpenAPI spec.
  3. Team member announces the breaking change. ...

My point here is mainly the addition for the 2nd step: mark fields (or endpoints) as deprecated in the OpenAPI spec. Fields can be marked as such using deprecated: true.

I added this and complemented with some extra info in the draft https://www.mediawiki.org/wiki/Wikimedia_Product/Wikimedia_Product_Infrastructure_team/API_endpoint_stability_policy#Deprecated

I adjusted the verbiage to be a bit more general and allow some flexibility in how to handle the rare cases where deprecation gets held up.

Thanks, @dr0ptp4kt, I'm moving it to Code Review because the proposal seem solid enough to a last round of review before publishing it.