Page MenuHomePhabricator

RESTBase and domain renames
Closed, DeclinedPublic

Description

Domain renames (redirects, really) do happen from time to time (the latest example being T31919: Move the wiki of WMEE). In such instances, the new domain is set up on the ops/MW side (DB, Apache, etc) and the old domain is then redirected to the new one in Apache. We need to define the set of events / changes needed in order to fully comply with the rename/redirect.

The obvious thing that is needed on the RESTBase side is adding the new domain to the configuration. But, the real question is - what to do with the old domain config stanza and how to keep the historical data accessible?

Option 1: Remove the old domain

The (somewhat) naive approach would be to simply remove the entry. Alas, the decision to reroute the traffic to RESTBase is made in Varnish, i.e. before the request even reaches Apache, which means removing the domain effectively obscures the data present in RESTBase if a direct call to https://old-domain.wp.org/api/rest_v1/.... Moreover, issuing requests to https://new-domain.wp.org/api/rest_v1/... would not return the data already present in storage, but would need to be recomputed, voiding RESTBase's promise of keeping historical data accessible. In order to mitigate the issue, we could (manually) update the records in storage to reflect the change, but that still does not allow clients to have stable, reliable URIs, as removing a domain causes all requests to RESTBase for that domain to spit out 404s.

Option 2: Leave both domains

A direct alternative to option 1 is having both domains in RESTBase. This would allow us to keep serving the old data while collecting and storing updates for the new domain and its data as they happen. Moreover, apart from a small config change adding the new domain, no other intervention is needed on our side (which is a Good Thing). However, this approach is not scalable as it may lead us to have to exact copies of the exact same data in storage. As an example, consider the following sequential events:

  1. (A lot of) data being stored for domain a.wp.org
  2. a.wp.org is renamed to b.wp.org (so from now on updates will be coming for b.wp.org)
  3. A client accesses the latest revision of an article (for which there was a new revision after the rename) via https://a.wp.org/api/rest_v1/
  4. RESTBase will realise it hasn't got the latest revision and will fetch it from the MW API asking for a.wp.org
  5. Apache will happily rewrite a.wp.org as b.wp.org and serve the revision information back to RESTBase
  6. RESTBase will store that info and ask Parsoid (and other related back-end services) to render the content
  7. Because of the Apache rewrite, everybody will be happy to do so and deliver the content to RESTBase, which stores it

Now, imagine a dump script or something similar doing this for every article for both a.wp.org and b.wp.org - we end up with two exact copies without ever realising they are, in fact, copies of each other.

Additionally, placing a request to https://b.wp.org/api/rest_v1/... for data that exists in storage for a.wp.org would not produce that version, but would be re-rendered.

Option 3: Handle rewrites

Another option would be to make RESTBase redirect-aware, so that it performs (internally) more or less the same domain-name manipulation as Apache does. This could be indicated in the configuration file with a simple redirect stanza:

/{domain:a.wp.org}:
  x-redirect-to: b.wp.org
/{domain:b.wp.org}: *wp/default/1.0.0

This stanza would instruct RESTBase to rewrite the domain to b.wp.org whenever request.params.domain === 'a.wp.org', which mitigates the double-storage problem of option 2 while providing stable URIs. In order to be able to expose old data as well, an additional check should be made during start-up that looks for data associated with a.wp.org and updates it to b.wp,org if such data is found.

Discussion

Let's discuss! What do you think?

Option 1 shouldn't be considered as it actually lacks correctness wrt RESTBase's goals. Option 2 is correct, but might lead to unnecessary storage issues in the long run. I personally feel option 3 is the way to go. A concern there might be start-up performance when there are redirected domains. We might want to expand x-redirect-to and perhaps include a flag whether to check the storage on start-up:

/{domain:a.wp.org}:
  x-redirect:
    to-domain: b.wp.org
    migrate_data: false

To be on the safe side, migrate_data should default to true, but this can be changed to false as soon as the first worker finishes its start-up process.

Open question: do we need to think about situations where a.wp.org stops being redirected to b.wp.org and becomes a (new) wiki of its own? Option 3 allows that (after the data has been migrated to the new domain), but with that we lose URI stability. To be fair, though, so does MW at the same time.

Related Objects

Event Timeline

mobrovac raised the priority of this task from to Medium.
mobrovac updated the task description. (Show Details)

@mobrovac said:

I personally feel option 3 is the way to go.

I agree, with one But.

...looks for data associated with a.wp.org and updates it to b.wp,org if such data is found.

Where the domain is a component of the partition key, this means completely rewriting the affected rows, no?

Could we maybe just rewrite all the way down into storage (in perpetuity of course)?

Where the domain is a component of the partition key, this means completely rewriting the affected rows, no?

Yup.

Could we maybe just rewrite all the way down into storage (in perpetuity of course)?

I was contemplating something similar. As in:

  1. if domain is not a redirect, go to step 5
  2. check storage for the original domain
  3. if 200, return that
  4. rewrite the domain
  5. go down the usual path

In order to be able to serve historic content, though, we'd need the inverse to work too. As in, if the config says:

/{domain:a.wp.org}:
  x-redirect:
    to-domain: b.wp.org

And a request for https://b.wp.org/api/rest_v1/... comes in, then on a 404 from storage we'd need to check the storage for a.wp.org as well. And we'd need to do it in each module. Unless, perhaps, we could offload that to the storage module, but I fear that might mean re-implementing the logic in various storage modules.

Finally, a conceptual problem I'm having with this approach is that of data consistency. Making redirects on-the-fly (by either RESTBase or its storage module) means tying the data to RESTBase's config, which is a path I'd reluctantly take.

[ ... ]

Could we maybe just rewrite all the way down into storage (in perpetuity of course)?

I was contemplating something similar. As in:

  1. if domain is not a redirect, go to step 5
  2. check storage for the original domain
  3. if 200, return that
  4. rewrite the domain
  5. go down the usual path

In order to be able to serve historic content, though, we'd need the inverse to work too. As in, if the config says:

/{domain:a.wp.org}:
  x-redirect:
    to-domain: b.wp.org

And a request for https://b.wp.org/api/rest_v1/... comes in, then on a 404 from storage we'd need to check the storage for a.wp.org as well. And we'd need to do it in each module. Unless, perhaps, we could offload that to the storage module, but I fear that might mean re-implementing the logic in various storage modules.

I guess what I was thinking was that once you had provisioned a.wp.org, then that wiki would forever be a.wp.org insofar as the data is keyed. If a rename occurred, we'd create a redirect, mapping b.wp.org -> a.wp.org (and back-again for result sets, as necessary). Essentially, you'd be making b.wp.org an alias to a.wp.org. Obviously this isn't ideal either, but I think these domains are relatively stable, that a rename is an exceptional event, and that we wouldn't accumulate too many of them.

Finally, a conceptual problem I'm having with this approach is that of data consistency. Making redirects on-the-fly (by either RESTBase or its storage module) means tying the data to RESTBase's config, which is a path I'd reluctantly take.

In this case, it would tie the mapping of new domain to old; The "alias" would be bound to the config, yes.

My vote would be to shelve this until we actually have a need to preserve old data for a significant project across a rename. For the small projects that were renamed so far option 1) wasn't an issue really, and big projects are a lot less likely to be renamed.

Maybe we could also come up with a more general read-only fall-back mechanism / strategy as well, as that could help with breaking schema migrations in general (not just domain fall-backs). But, this doesn't feel that high priority either, as we don't have any such changes coming up right now.

GWicke lowered the priority of this task from Medium to Low.Oct 4 2016, 7:56 PM
GWicke set Security to None.

Two years later, this has not been an issue in practice. I am going to be bold & decline this task for now. We can resurrect it if needed at a later point.