This document is an attempt to formalize the output of the "Architecting Core: Standalone Services" session from the 2018 Wikimedia Technical Conference (see https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2018/Session_notes/Architecting_Core:_stand-alone_services and T206082).
The is also on wiki at: <https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_external_services>, but discussion should continue on phabricator at this point.
The following proposals aim to set architectural principles and standards for when, how and why to to extend MediaWiki via an external service for use in the WMF production environment.
== Definition ==
''Standalone services'' are applications that extended MediaWiki's functionality, but that are operationally distinct in some meaningful way. The mechanism of extension is not specified, beyond saying that it's necessarily interprocess as opposed to intraprocess; queues, API calls and other types of RPC mechanisms, XHR from a client, are all valid examples, while shelling out from MediaWiki itself is not. It is important that the core business logic is implemented in the service, rather than within MediaWiki itself.
==Selection Criteria: Deciding whether an external service is appropriate==
The properties listed here are intended to be a guide as to whether a given feature can be provided externally to MediaWiki or not. They are intended to be necessary, but not sufficient. If a proposed feature has one or more properties that appear in the left column, but no properties that appear in the right column, then that feature could be implemented as a standalone service. If the proposed feature has one or more properties that appear in the right column, then a wide consensus must be reached before implementing it as a standalone service.
| Properties that make a feature ''suitable'' for implementation as a standalone service | Properties that make a feature ''unsuitable'' for implementation as a standalone service |
|State is independent - the functionality provided does not require view of MediaWiki state that is guaranteed to be current. | Feature only works correctly with a consistent and current view of MediaWiki state.|
|3rd party library or service exists that can provide the needed functionality with minimal integration| Feature requires direct access to the MediaWiki database, and cannot use an API to retrieve or update data.|
|A non-PHP language or framework exists that significantly simplifies implementation | Requires features/functionality provided by other MediaWiki extensions that are implemented internally|
|Functionality gracefully degradates if the external service is unavailable |Unavailability of the external service compromises the general availability of the site to the user (e.g. results in a MediaWiki fatal error)
|The feature is independently useful, and is likely to have non-MediaWiki use cases| The Feature involves directly parsing wikitext, or accessing MediaWiki i18n messages (also wikitext)|
If any of the following are true, then the feature '''absolutely should be implemented as an external service''', with appropriate architectural changes made elsewhere to eliminate disqualifying properties.
|Properties that ''require'' that a service is provided externally to MediaWiki|
|Elevated security need: Due to data isolation or other operational requirements, a given feature cannot be provided in the same operational environment as MediaWiki itself.|
|Excessive or potentially unbounded resource needs: Image thumbnailing, video transcoding, and machine translation are all examples of features where unpredictable properties such as request rates and input size have a significant impact on the resources required, and based on factors that the operator can't control may result in resource contention and denial of service.|
|Long-running processes involved.|
|The feature in question is used to triage or fix MediaWiki in the case of failures|
|The application is going to be run in a separate environment from MediaWiki itself|
Given MediaWiki is not just part of the Wikimedia production infrastructure but also a software used my many third-parties, we can't just delegate some of its fundamental functions to an external service completely. Whenever an external service provides a functionality, we also need to ask ourselves if said functionality is fundamental or optional to MediaWiki.If the functionality replaced by the external service is fundamental, a fallback solution must be present within MediaWiki that substitutes what is being implemented in the service. As an example: for Wikimedia purposes, async processing for MediaWiki is implemented via a series of external services; for simple installations, a MySql-backed version of the same mecahnism exists and cannot be dropped from MediaWiki. What constitutes a core functionality of MediaWiki and what doesn't will need to be further defined elsewhere, and is beyond the scope of this document.
== Architectural Guidelines for external services ==
This document only deals with principles (e.g. an application needs to be observable and expose appropriate metrics) and not with implementation guidelines. Practical implementation guidelines will be written to detail how the principles enumerated in this document are to be applied in technical terms (e.g. the application must expose RED metrics from a <tt>/metrics</tt> endpoint in prometheus format, with a precise naming convention). The reason to split the two parts is that while we don't expect the principles to change much across time, but we do expect the implementation guidelines to change quicker due to technical evolution.
There are various aspects of the development and usage cycle of a new service, and several of those need to be as standardized as possible across the board in order not to make the complexity of our ecosystem become unmaintainable. In general, adopting a non-monolithic architecture has its costs, and unless standards are maintained regarding how different applications need to interoperate and how they're developed.
There are several aspects of the development of a service that need to be taken into account:
* Development policies
* Security/privacy requirements
* Production deployment
In the next few sections we analyze the requirements a new service must fulfill in each of those categories.
=== Development policies ===
Everything we develop should be free, open to collaboration and useful in itself. So, a new service must:
* Actually do something
* There is no existing FLOSS software that provides the same functionality
* Avoid needlessly duplicating features or functionality provided in other services
* Be licensed under an OSI-approved license
* Provide a configuration mechanism that does not involve changing the distributed code
* Use a language and toolset that have been approved by TechCom
While some of our services will be only useful in the WMF context, in other cases the standalone service is intended to be distributed for general use. In that case, it must have the following properties:
* Have a documented installation and uninstallation process that conform to our implementation guidelines
* Have a documented upgrade process that conform with our implementation guidelines
* Be versioned using semver
* Indicate versions of MediaWiki with which it's compatible
* Provide a mechanism by which support (community or otherwise) can be requested
* Provide a mechanism by which patches can be proposed
* Provide a mechanism by which public security advisories are issued
=== Security and privacy ===
All features implemented as standalone services must have the following properties:
* Minimize data collection for any type of PII
* Be compliant with the WMF privacy/data retention policies.
* Implement privacy controls that are ''at least'' equivalent to those of any calling service. For example, if the privacy controls of the calling service specify that IP addresses will not be stored for more than 90 days, the external service may not store IP addresses for longer than that time.
*Have passed a Security review
*Have resources allocated so that a prompt response to any security incident is possible
=== Production deployment ===
If the standalone service is intended to be used in the Wikimedia production environment, it should comply with the guidelines above, and in addition must
* Be deployable with standard WMF tooling (as specified in the implementation guidelines)
* Have an owner, and a plan for onging maintenance. If the owner of a service is missing (because the team is disbanded/has a different focus), a new owner must be found via the code stewardship process
* Have logging that conforms to the WMF standards - specified in the service implementation guidelines
* Collect [[ https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/ | RED ]] metrics; be able to export those metrics according to WMF standards specified in the implementation guidelines
* Have a [[https://en.wikipedia.org/wiki/Runbook | runbook]] for operational purposes
* Support a multi-datacenter active-active (or active-passive) deployment
*[[ https://en.wikipedia.org/wiki/Service_level_indicator | Service Level Iindicators]] must be defined for the service, and [[ https://en.wikipedia.org/wiki/Service-level_objective |Service Level Objectives]] should be agreed upon. Failure to meet said service level objectives SHOULD result in actions aimed at getting back on track. The Service Level Objectives can of course be reevaluated and changed, but preferably not as a result of a violation but rather an informed process
* Have pinned / pinnable dependencies that don't need to be downloaded at runtime and/or from untrusted source
*Have backups if the service stores any data
* Have users, or a plan to acquire users
==== Service - Service interaction ====
Services will likely interact with each other; if that is the case, measures must be taken not to make the whole system dependent on the failure of a single component. Also, increased observability in the flow of requests is needed. So any new service that needs to be deployed in production should:
* Degrade gracefully its functionality if it can't access another service. If that's not possible, maybe the new service should be logically tied to the other. An exception is explicitly made for the MediaWiki API, given quite a few services might depend on its availability to be useful.
* Implement [[https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern |circuit breaking]] and backoff in case the server responds with an error or times out
*Include concurrency limiting policies on all requests (server-side)
* Add the appropriate tracing headers to the request, according to the WMF standards specified in the implementation guidelines
* Log all requests received via the production logging facilities
* Be able to perform requests via TLS to a specific hostname/ip provided via configuration.
* Provide telemetry information about requests performed to other services, following the implementation guidelines.
(The text above is also on wiki at: <https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_external_services>, but discussion should continue on phabricator at this point.)