Page MenuHomePhabricator

RESTBase support for www.wikimedia.org missing
Open, MediumPublic

Description

The canonical domain is www.wikimedia.org, wikimedia.org is generally only a redirect.

However for some reason the RESTBase API on that domain has "wikimedia.org" as the RESTBase Varnish override happens before the Apache can normalise the URL.

https://wikimedia.org/ -> https://www.wikimedia.org/ -> 200 OK
https://wikimedia.org/api/ -> https://www.wikimedia.org/api/ -> 200 OK

https://wikimedia.org/api/rest_v1/?doc - 200 OK
https://www.wikimedia.org/api/rest_v1/?doc -> 404 Not Found

It would make a lot more sense if at the very least this redirected. But more logical would be for it to be the other way around.

Event Timeline

Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald Transcript

In the case of RESTBase, these are actually two distinct domains for us. We are using wikimedia.org as a sort-of global domain which exposes content that pertains to WM in general and is not tied to a specific project. If you compare the docs for wikimedia.org and en.wikipedia.org you'll notice that the endpoints are actually different.

So, it's not that wikimedia.org should redirect to www.wikimedia.org but rather that RESTBase support for www.wikimedia.org is missing.

mobrovac renamed this task from RESTBase for wikimedia.org should be on www.wikimedia.org to RESTBase support for www.wikimedia.org missing.Jul 12 2016, 1:53 PM

Technically, I don't think there is a wiki at www.wikimedia.org, so a REST API wouldn't make a lot of sense. That said, people widely expect www.wikimedia.org to work the same as wikimedia.org, so perhaps we could make an exception here.

Do we ever expect www.wikimedia.org to become a regular wiki? If no, then I'd say lets just treat it as an alias for wikimedia.org, perhaps by stripping the www. prefix in Varnish.

GWicke triaged this task as Medium priority.Oct 12 2016, 9:31 PM
GWicke edited projects, added Services (next); removed Services.

Do we ever expect www.wikimedia.org to become a regular wiki?

Unlikely indeed. But in the event that it does, I'm certain wikimedia.org would be a redirect to it. (Just as https://wikimedia.org is a redirect to https://www.wikimedia.org now).

In the case of RESTBase, these are actually two distinct domains for us. We are using wikimedia.org as a sort-of global domain which exposes content that pertains to WM in general and is not tied to a specific project. If you compare the docs for wikimedia.org and en.wikipedia.org you'll notice that the endpoints are actually different.

So, it's not that wikimedia.org should redirect to www.wikimedia.org but rather that RESTBase support for www.wikimedia.org is missing.

wikimedia.org and www.wikimedia.org are the same. www.wikimedia.org is canonical.

When setting up RESTBase, the entrypoint was (presumably by accident) wrongly baptised as "wikimedia.org". RESTBase is the only thing that uses wikimedia.org instead of www.wikimedia.org.

I don't know if this can be easily renamed, since it'd be preferable to have wikimedia.org be an alias to www.wikimedia.org, not the other way around. Either way, aside from existing users that have made it through the forrest, new users following our documentation and expected entry points, will continue to strand on the 404 Error at https://www.wikimedia.org/api/rest_v1/?doc.

@Krinkle, at the time IIRC we wanted to allow for the possibility that www.wikimedia.org could become a wiki in its own right at some point, in which case it would also have its own REST API specific to that project. This would then conflict with the special wikimedia.org API, which has data that is global across all wikimedia projects.

If we are sure that www.wikimedia.org won't ever be a wiki, then we can just set up the alias.

If we are sure that www.wikimedia.org won't ever be a wiki, then we can just set up the alias.

Yeah, I think we can be fairly sure of that. If not, then we'll have to resolve T138848 by normalising /api on wikimedia.org in the other direction instead. Which would leave a weird inconsistent state where non-api access redirects to www, but api access normalises to non-www.

Another way to prevent this theoretical conflict is to not use the same entry path for the special API as the per-wiki API. For example, it could live at www.wikimedia.org/global/rest/v1, www.wikimedia.org/global/api/rest_v1, or something else that isn't /api/rest_v1. And we'd have an alias to this only from non-www /api/rest_v1 for back-compat.

Okay, I'm on board with just configuring www.wikimedia.org/api/ and www.wikimedia.org/api/rest_v1/ as aliases for their wikimedia.org equivalents. To avoid confusion from different URLs floating around, I would propose to redirect from www.wikimedia.org/api/* to wikimedia.org/api/*. Reasons:

  • Solves the discovery problem.
  • Sends all consumers to the current, canonical domain. All consumers will thus use the canonical wikimedia.org domain.
  • If a wiki is set up at www.wikimedia.org later, we can just drop the redirect. This should not break any existing clients pointed at wikimedia.org.

@Krinkle, @mobrovac, @Pchelolo: Does this sound like a good course of action to you? If there are no objections, I will look into setting this up next week.

+1 on setting up the www.wm.org -> wm.org redirect in Varnish.

Unfortunately, setting up these redirects for /api/ only turns out to be a bit more involved than I hoped. wikimedia.org is redirected to www.wikimedia.org through a high-priority redirect auto-generated from redirects.dat. The only exception to this is /api/rest_v1/, which is handled at the Varnish level for both www.wikimedia.org (not found in RESTBase), and wikimedia.org.

Short of changing the general host redirect from wikimedia.org to www.wikimedia.org, this basically means that setting up a redirect from www.wikimedia.org/api/* to wikimedia.org/api/* would need to happen at the Varnish level. There are precedents for mobile domain redirects, but they are quite tricky, and involve the use of fake "666" internal status codes. Technically this is clearly possible, but it does feel a bit hacky.

Overall, I think I would prefer a solution that redirects consistently in one direction. As far as I can tell, the static www.wikimedia.org project portal could just as well live at wikimedia.org. If we set up a redirect from www.wikimedia.org to just wikimedia.org, and adjusted the few direct references to static resources, then we would be consistent between API & non-API use, and we would also avoid adding more redirect logic in Varnish. We would also keep the references to www.wikimedia.org to a minimum, which could come in handy if for some reason we decided to make that a top-level project in its own right again.

Do you see any gotchas in making wikimedia.org the canonical domain? What is your overall take on that solution?

To summarize the options using a single domain only:

Use www.wikimedia.org only

Pros

  • Follows the common www. convention.
  • Established for the portal site.

Cons

  • Impacts existing API clients using wikimedia.org
    • Redirect: Establishes clear primary, but slows API clients until updated. Don't control most API clients.
    • Rewrite in Varnish: Avoids impact on existing clients, but confuses users by exposing the same data through different domains.
  • API would break if we ever convert www.wikimedia.org back to a wiki project.

Use wikimedia.org only

Pros

  • Arguably more logical to use the bare tld for cross-project data (compared to *.wikimedia.org projects).
  • No impact on existing API clients.
  • Does not break if we reintroduce a wiki at www.wikimedia.org at a later point.

Cons

  • Does not follow www. convention.
  • Slows responses for users hitting www.wikimedia.org.

I think it's fair to at least put forward arguments for a 3rd point of view as well:

Use a distinct hostname like rest.wikimedia.org for the cross-wiki API entrypoint of the REST API

Pros

  • Completely evades the issues around www.wikimedia.org vs wikimedia.org
  • API would not break if we ever convert www.wikimedia.org back to a wiki project (or anything else we do with the root and/or www hostnames there)
  • wikimedia.org is, in a logical sense, meta to more than just cross-wiki or even overall "application"-layer issues, it's also the organizational domainname for a wide variety of projects and services which are completely unrelated and will never be involved in wiki-centric APIs (e.g. phabricator, grafana, gerrit, etc..). It doesn't make sense to host a wiki at the root or www hostnames there (we don't - just a redirect and a generic portal page), and similarly it doesn't make sense to host a cross-wiki API there.

Cons

  • As with www.wikimedia.org: Impacts existing API clients using wikimedia.org (at least, until the old entrypoint is deprecated and then removed):
    • Redirect: Establishes clear primary, but slows API clients until updated. Don't control most API clients.
    • Rewrite in Varnish: Avoids impact on existing clients, but confuses users by exposing the same data through different domains.

To @BBlack's concern about scoping on projects vs. all *.wikimedia.org domains:

So far, both the portal at www.wikimedia.org, as well as the APIs at wikimedia.org have been focused on top-level projects. This seems to have worked reasonably well in practice. I am not aware of actual confusion or complaints from users about internal projects like phabricator, grafana or gerrit missing from the portal and APIs.

If we still feel that we should make the focus on projects clearer, then a domain that is more explicit about scoping, such as projects.wikimedia.org, might work better. The older rest.wikimedia.org on the other hand does not seem to be any clearer than wikimedia.org on what it would include or not include, and also comes with historical baggage. It also suggests a restriction to the REST API, which is not intended.

@Krinkle, I don't know the exact time period www.wikimedia.org was active as a wiki, and yes it was the foundation wiki. IIRC I mainly found the redirect in the apache config, which might well be a relic from the last decade.

@GWicke At which point was wikimedia.org (or www.wikimedia.org?) a wiki? Assuming this would've been the foundation wiki (currently wikimediafoundation.org), I traced it back using Archive.org to September 2004.

https://web.archive.org/web/20040923013337/http://wikimediafoundation.org:80/wiki/Home

Captures for wikimedia.org and www.wikimedia.org both before (in 2003), during (in 2004), and since that (2005 – now) only show the portal page.

https://web.archive.org/web/20030727071930/http://wikimedia.org:80/
https://web.archive.org/web/20030724173454/http://www.wikimedia.org:80/
https://web.archive.org/web/20040610074601/http://wikimedia.org:80/
https://web.archive.org/web/20031225095240/http://www.wikimedia.org:80/

@GWicke Regarding T138848, note that there are two separate problems imho. I don't mind them being solved at the same time, but:

  1. The issue of consistency. (www.wikimedia.org is canonical, except for the /api/rest_v1/ endpoint for which non-www is canonical, and www doesn't work.)
  2. The issue of discoverability. wikimedia.org/api/ redirects to www.wikimedia.org/api/, which advertises the non-existent endpoint https://www.wikimedia.org/api/rest_v1/?doc (it also advertises an impossible Action API, similarly 404, and another 404, naturally, for the rest_v1 stylesheet). – I guess we just shouldn't expose this API page on that domain, and instead redirect to rest_v1 directly.

The other www-portals don't have neither an Action API, nor a RESTBase API, but, they also don't have /api/ (it's either 404, or redirects to a wiki,
like www.wikipedia.org/api/ redirecting to en.wikipedia.org/api/)

@Krinkle: Agreed that there are some subtle differences between the tasks. I mainly merged them since the discussion here has converged on using one domain for both the portal & global API, or use a different domain altogether. Both solutions would fix the other task as well.

To summarize the options using a single domain only:

Use www.wikimedia.org only

Pros

  • Follows the common www. convention.
  • Established for the portal site.

Cons

  • Impacts existing API clients using wikimedia.org
    • Redirect: Establishes clear primary, but slows API clients until updated. Don't control most API clients.
    • Rewrite in Varnish: Avoids impact on existing clients, but confuses users by exposing the same data through different domains.

I'd recommend the latter, but not indefinitely. We'd deprecate REST on wikimedia.org, but it will work indefinitely as a redirect (naturally, given the domain's default behaviour). In the interim (e.g. 6 months to a year) we can maintain the silent rewrite as performance optimisation until most users have migrated. A rel=canonical Link header and deprecation announcement would help.

I'd recommend the latter, but not indefinitely. We'd deprecate REST on wikimedia.org, but it will work indefinitely as a redirect (naturally, given the domain's default behaviour). In the interim (e.g. 6 months to a year) we can maintain the silent rewrite as performance optimisation until most users have migrated. A rel=canonical Link header and deprecation announcement would help.

This sounds reasonable to me. Any objections against going with www.wikimedia.org, and migrating gradually as proposed by @Krinkle?

This sounds reasonable to me. Any objections against going with www.wikimedia.org, and migrating gradually as proposed by @Krinkle?

+1 from the client-side perspective. On the RESTBase side, though, we have to figure out how to do the transition (for all of our environments as well as for 3rd-party installs) given the fact that we use wm.org as a global domain internally. Simply renaming it in the config file is not an option as it would cause too much disruption. We could have an explicit, internal rewrite in RESTBase: www.wikimedia.org -> wikimedia.org

On the RESTBase side, though, we have to figure out how to do the transition (for all of our environments as well as for 3rd-party installs) given the fact that we use wm.org as a global domain internally. Simply renaming it in the config file is not an option as it would cause too much disruption. We could have an explicit, internal rewrite in RESTBase: www.wikimedia.org -> wikimedia.org

I agree that there are quite a few places that would need updating, but on the other hand doing a search & replace is not *that* hard. The bigger question I see though is about preserving data stored at wikimedia.org, especially math. The migration to Cassandra 3 could provide a natural opportunity for migrating data, perhaps with a temporary fall-back read scheme similar to the one we use for the HTML content migration.

Given a choice between adding rewrites & maintaining the legacy domain going forward, and fully migrating things now, I would propose to go for the latter. The cost of doing so seems quite reasonable now, especially when done as part of the Cassandra 3 migration.

@Krinkle - Is this ticket still worth pursuing at all?

Yes, in so far that the primary RESTBase URL is still completely broken if accessed through the canonical version of that domainname as we advertise it.

https://wikimedia.org/ -> https://www.wikimedia.org/ -> 200 OK
https://wikimedia.org/api/ -> https://www.wikimedia.org/api/ -> 200 OK

This page links to https://www.wikimedia.org/api/rest_v1/?doc which is a 404. Ideally this would be fixed by one of these:

  • Change RESTBase to "just" making its internal name match the canonical name instead of using a non-standard name like it does now.
  • Or, by making the wikimedia.org our only root domain that makes non-www its canonical. And thus make the non-api part of it canonical over non-www as well.
  • Or, by changing all our www-portals to canoncalize on non-www.
  • Or, by changing RESTBase to support both and redirect accordingly, which would be odd to have part of the site redirect toward www and part of it away from, but at least it would reflect current reality and make things work. I don't know if RESTBase supports such alias concept, though.
  • Or, by changing Varnish to transparently rewrite www/api as non-www, which would make mean both just work and without redirects, perhaps is slightly less confusing, but not by much :)

From the math perspective, the change to the new MW Rest API is already implemented but not yet reviewed. Thereafter, restbase is no longer used by the Extension:Math.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!