Page MenuHomePhabricator

Flatten internal URL structure and move away from a layered overloading model
Closed, ResolvedPublic

Description

With the potential for multiple handlers to overload a URL, it can be difficult to determine where a request will be routed internally. We currently use a rule of forwarding a request to the next routing layer if the URL did not change, but this is pretty subtle and not terribly intuitive.

In cases when we know for certain where we want a request to be handled (e.g. at the storage layer, or by a service handler, etc.) it helps to be able to specify a URL that will trigger that handler explicitly.

For this task, let's identify handlers that can be pulled out of the overloaded namespaces, and assign them more intentional URL patterns. Those can then also be used directly in error messages and metrics.

Examples

  • /{domain}/sys/parsoid/: parsoid-managed content: html, wikitext, data-parsoid, data-mw
  • /{domain}/sys/page_revision/: page and revision metadata
  • /{domain}/sys/table/: table storage layer; currently implicitly handled in storage.js

Request routing

With this in place, we can syntactically prevent external requests from hitting internal entry points. All access will instead be mediated and documented by external handlers, which will in most situations forward requests to the bucket layer.

Related Objects

Event Timeline

Jdouglas raised the priority of this task from to Needs Triage.
Jdouglas triaged this task as Medium priority.
Jdouglas updated the task description. (Show Details)
Jdouglas added a project: RESTBase.
Jdouglas changed Security from none to None.
Jdouglas edited subscribers, added: GWicke; removed: Aklapper.
Jdouglas subscribed.
GWicke renamed this task from Make intentional low-level paths more explicit to Flatten internal URL structure and move away from a layered overloading model.Dec 6 2014, 12:24 AM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)

I started to throw some ideas into a branch:

Generally, I feel that the logic we have in the yaml test handler is already too much. The logical next step would be to add more functionality there to handle more complex scenarios, which could lead to some half-assed yaml-based programming language. I think it'll be better to leave code to code & focus the spec handlers on declarative support for the typical tasks ('80%'). With a module structure we should be able to add code in the /_sys/ namespace, which can then be used by the spec handlers where necessary for more complex things.

This is all not very worked out yet, but I think the direction is promising. The biggest questions I have right now are:

  • How can we make sure that some handling always happens during the access to some data? Example: revision deletion/suppression vs. direct access to tables and KV buckets; should we enforce 'can only accessed through handler X'?
  • Will the mental model of the request flow be manageable? What will the major pieces be?

I think it'll be useful to work through the typical use cases (especially page content and perhaps those for buckets) to see where the issues are.

Some progress:

There is a new global config structure in config.example.yaml. I separated interfaces (swagger specs) from modules (implementations). Modules can be distributed as npm modules (or we can just have a folder full of them for now). The operationId property in the swagger spec can be used to match swagger routes to module exports. There are per-module options and resources. Options are passed to the module constructor. Resources are requests that are done after all modules are registered in the global namespace (get / put). A body can be specified, in which case it is compared with the returned body & sent if that doesn't match.

Mounting of interfaces vs. knowledge of global (absolute) paths

The tree-based router supports mounting of specs somewhere further down the tree, for example at /en.wikipedia.org/v1/page. This lets us use the same spec for different domains, or under different per-domain API versions (v2 for example).

Since the specs themselves don't really know how they will be mounted, it becomes difficult to use absolute paths to access other APIs from internal specs and modules. Even the absolute path to a /sys/ hierarchy is not necessarily known unless we make some assumptions about the structure of the tree.

Since we'd like to avoid direct external access to the sys namespace, an option would be to use a separate 'protocol' namespace, as in sys://parsoid/... The question then becomes how parameters like the domain are injected or handled. We kind of assume that we have a domain, so it might not be worth hiding it from the path.

The other option is to just define the canonical system namespace to live at /{domain}/sys/, and to always require the domain prefix. Versioning of the sys namespace can happen at the module level, so either /{domain}/sys/parsoid/v1/, or /{domain}/sys/parsoid2/.

Where to handle rev content read / write

One thing I was grappling with while fleshing out the content interface is how to organize the interaction between front-end handlers and system modules, especially for more complex cases like Parsoid.

In rev_content

  • needs to know about all backend buckets
  • can pull uniform content from services, return part
  • harder to understand for service author
  • harder to add storage in second step

In service: parsoid, simpleservice

  • can (optionally) configure backend storage for each service in simpleservice module
  • service module knows how to read / save its own content
  • easier mental model for service authors
  • can do more complex read processing if needed
  • page_revision service roles:
    • get latest revision for title (from api or local table)
    • get timeuuid range & metadata for revision
    • list pages & revisions (can expose this directly from front-end)
  • bock service roles:
    • blocks for url / revision

James, Marko & I discussed this earlier IRL.

Decisions:

  • Standardize on /{domain}/ prefix and /{domain}/sys/for internal resources.
  • Perform access to content through service wrappers. In particular, the parsoid module at /{domain}/sys/parsoid/ will own storage for its content types, and transparently call out to the actual parsoid service. It will in turn use page/revision metadata from /{domain}/sys/page_revision/.

Action items:

  • James continues work on T75955
  • Marko tackles swagger-router improvements:
  • Gabriel continues work on T76986 and T78194, and starts work on loading of interfaces / modules using the new design

Basic spec loading with sharing has now landed on [the modules branch](https://github.com/wikimedia/restbase/tree/modules) in ae15c7e8a

Next steps:

  • Flesh out sys interfaces - in progress
  • Add actual module loading and symbol matching: partly done, async work in T86875
  • Convert some minimal set of modules to the new-style API - in progress
  • Add real sub-request ('service') handling: T86737
  • Hook the entire thing up in restbase.js - in progress
  • Protect /sys - done
  • Add routing tests - only some right now; will get lots of implicit testing with RB integration tests
  • Get tests to pass

In today's team meeting we determined that we'll make a decision on whether we are going ahead with this refactor before the deploy by Friday. A major reason for this refactor is to clean up the code structure in preparation for third-party extensions and new entry points. It also properly addresses several release-related issues in a way that we feel is solid in the longer term.

Plan B is to do something simpler for the release, and continue with the refactor afterwards.

GWicke raised the priority of this task from Medium to High.

https://github.com/wikimedia/restbase/pull/118 has now landed, which resolves this task.

RESTBase is now fully spec-driven, auto-creates storage tables & supports spec-derived listings.